VOC bijeenkomsten in periode 1998-2003

  • VOC-najaarsbijeenkomst 2006
  • VOC-voorjaarsbijeenkomst 2006
  • VOC-najaarsbijeenkomst 2005
  • VOC-voorjaarsbijeenkomst 2005
  • VOC-najaarsbijeenkomst 2004
  • VOC-voorjaarsbijeenkomst 2004
  • VOC-najaarsbijeenkomst 2003
  • VOC-voorjaarsbijeenkomst 2003
  • VOC-najaarsbijeenkomst 2002
  • VOC-voorjaarsbijeenkomst 2002
  • VOC-najaarsbijeenkomst 2001
  • VOC-voorjaarsbijeenkomst 2001
  • VOC-najaarsbijeenkomst 2000
  • VOC-voorjaarbijeenkomst 2000
  • VOC-jubileumcongres 1999
  • VOC-voorjaarsbijeenkomst 1999
  • VOC-najaarsbijeenkomst 1998
  • VOC-voorjaarsbijeenkomst 1998
  • Terug naar recente bijeenkomsten

  • Najaarsbijeenkomst 2006

    De Najaarsbijeenkomst van de VOC vindt plaats op vrijdag 3 november a.s te Leiden. Het programma heeft als thema 'Veranderingen over de tijd'. Paul van Geert (Rijksuniversiteit Groningen) zal spreken over dynamische systeemmodellen in de ontwikkelingspychologie, en Geert Verbeke (Katholieke Universiteit Leuven) over linear mixed modellen voor longitudinale data. De overige sprekers zijn Cees Elzinga (Vrije Universiteit), Ellen Hamaker (Universiteit Utrecht), Reinoud Stoel (Universiteit van Amsterdam) en Mark de Rooij (Universiteit Leiden). Het programma is als volgt:

    10.00 COFFEE
    10.30 Geert Verbeke Predicting renal graft failure using multivariate longitudinal profiles
    11.30 Reinoud Stoel To be, or not to be (on the boundary of the parameter space)
    12.05 LUNCH
    13.15 Ellen Hamaker Time series analysis in psychological research
    13.50 Cees Elzinga Metric representations of categorical time series
    14.25 Mark de Rooij Gravitational models for the analysis of change
    15.00 TEA
    15.30 Paul van Geert Dynamic systems approaches to long-term change and development: principles, models and analysis of data
    16.30 DRINKS

    Locatie: De bijeenkomst gaat door in Zaal CH05 van het tijdelijke gebouw dat "het chalet" wordt genoemd. Dit tijdelijke gebouw bereik je door uit het Pieter de la Court gebouw over het parkeerterrein te lopen. Routebeschrijving.

    De toegang tot de bijeenkomst is vrij. De lunch kost 10 euro, ter plaatse te voldoen. Opgave kan tot en met maandag 30 oktober bij Marieke Timmerman (m.e.timmerman@rug.nl). Mocht u GEEN lunch willen gebruiken, wilt u dat dan even aangeven bij opgave.

    ABSTRACTS

    Geert Verbeke (Biostatistical Centre, K.U.Leuven)

    Predicting renal graft failure using multivariate longitudinal profiles.

    In many medical studies repeatedly measured biomarker information is gathered together with a time to an event, e.g. occurrence of a disease. In such situations, the biomarker information serves as a health indicator representing the progression of the disease, and can therefore be used to predict the event of interest. The application motivating this presentation considers patients who received a kidney transplant and who are intensively monitored during the years after the transplant. The time intervals between subsequent clinic visits are different within and between the patients. The event of interest is graft failure from chronic rejection or recurrent disease within 10 years after the transplantation. Markers used to predict failure are serum creatinine, urine proteinuria, mean of systolic and diastolic blood pressure, and blood haematocrit level. Our aim is to construct a model that allows prediction of graft failure based on all available information, i.e., all repeated measurements of all 4 markers. Furthermore, it is of interest to investigate how the multivariate information outperforms the information in each marker separately, when it comes to predicting the event of interest.

    The proposed approach starts from a linear, a generalised linear or a nonlinear mixed model for each of the markers separately. These models are then joined into one multivariate longitudinal model by specifying a joint distribution for all random effects. Due to the high number of markers, a pairwise model fitting approach, where all possible pairs of bivariate mixed models are fitted, is used. Afterwards, the fitted models are used in Bayes rule to obtain, at each point in time, the prognosis for long-term success.

    Geert Verbeke is professor of biostatistics at the Biostatistical Centre of the Katholieke Universiteit Leuven in Belgium. His research interests are in various aspects of models for longitudinal data, with particular emphasis on mixed models. He is Past President of the Belgian Region of the International Biometric Society, served as International Program Chair for the International Biometric Conference in Montreal (2006), and is currently Joint Editor of the Journal of the Royal Statistical Society, Series A (2005-2008). He served as Associate Editor for various statistical journals, including Biometrics and Applied Statistics.

    Reinoud Stoel (University of Amsterdam)

    To be, or not to be (on the boundary of the parameter space)

    In this presentation, I will illustrate how the use of inequality constraints on parameters in structural equation models may affect the distribution of the likelihood ratio test (Stoel et. al, in press). Inequality constraints are implicitly used in the testing of commonly applied longitudinal structual equation models, like the autoregressive model, and the latent growth curve model, although this is not commonly acknowledged. Such constraints are the result of the null hypothesis in which the parameter value(s) are on the boundary of the admissible parameter space. For instance, this occurs in testing whether the variance of a growth parameter is significantly different from zero. It will be shown that in these cases the asymptotic distribution of the likelihood ratio (i.e. a chisquare-difference) cannot be treated as that of a central chisquare-distributed random variable with degrees of freedom equal to the number of constraints. The correct distribution for testing one or a few parameters at a time will be inferred, and I will describe the subsequent steps that one should take in order to obtain this distribution. Using the correct distribution may lead to appreciable greater statistical power.

    Stoel, R.D., Galindo-Garre, F., Dolan, C., & Van den Wittenboer, G. (in press). On the Likelihood Ratio test in structural equation modeling when parameters are subject to boundary constraints. To appear in Psychological Methods.

    Reinoud Stoel is an Assistant Professor Methodology and Statistics at the University of Amsterdam, department of Educational Sciences. His PhD thesis (2003) is titled “Issues in growth curve modeling”.

    Ellen Hamaker (University of Utrecht)

    Time series analysis in psychological research

    Time series analysis (TSA) is a class of techniques that allows us to determine the lawfulness underlying the variability within a system over time. In psychological research "system" can refer to a set of variables observed in a single individual, but also to variables observed in a dyad (e.g., mother and child, client and therapist, spouses), or even a family.

    Although TSA has been recognized as a potentially powerful tool for studying psychological processes at their core (i.e., within or between individuals), the applications in psychology are sparse and often of a metaphoric nature. There are several reasons for this lack of quantitative applications. First, most psychological researchers are unfamiliar with these techniques as TSA is not taught as one of the basic tools in social sciences. Second, there are many diverse techniques developed in different areas, which makes it difficult to decide what technique is needed and to what extent the techniques overlap. Third, the software needed to implement these techniques is rarely part of the standard software packages used in the social sciences.

    I discuss several empirical applications of TSA in psychology which illustrate the strength of TSA as a means to study psychological processes. From these examples it becomes clear that TSA allows us to test hypotheses that can not be investigated using our standard nomothetic techniques. In addition I discuss the possibility of using idiographic results obtained with TSA to build on nomothetic knowledge.

    Ellen Hamaker is an Assistant Professor Methodology and Statistics at the University of Utrecht. After having obtained her PhD degree (2004, “Time series analysis and the individual as the unit of psychological research”) she worked as a Postdoc in Quantitative Psychology at the University of Virginia for a year. In 2005, she obtained a Veni grant of NWO.

    Cees H. Elzinga (Free University)

    Metric representations of categorical time series

    This contribution considers sequence analysis as the problem of constructing metric representations of categorical time series. It is argued that the fundamental problem of sequence analysis is to construct metrics and similarity measures pertaining to attributes of pairs of sequences, such that these attributes are meaningful within the context of substantive social science theories. Four classes of alternatives to Optimal Matching are presented, algorithms are provided and pertaining geometries are discussed. These alternatives preserve attributes that are meaningful and relevant in a wide diversity of substantive applications. The classes of models are extended, and algorithms adapted, to handle duration in two different ways. Examples and illustrations will be taken from demography.

    Cees Elzinga obtained his Ph.D. in psychology at Radboud University Nijmegen. He published on color vision models and measurement theory during the eighties in Vision Research, Journal of Mathematical Psychology and the Journal of the Optical Society of America. From 1985 on, he held sales and general management positions in various international companies in quite different markets and returned to academia in 2002 as an assistant professor in the Dept. of Social Research Methods of the Faculty of Social Sciences of the Vrije Universiteit in Amsterdam. Since then, he published on representations of categorical time series and on latent Markov chains in Journal of Classification and in Sociological Methods & Research.

    Mark de Rooij (Leiden University)

    Gravitational models for the analysis of change

    Newton's law of gravity states that the force between two objects in the universe is equal to the product of the masses of the two objects divided by the square of the distance between the two objects. It will be shown that this law is very well applicable to the analysis of longitudinal categorical data where the number of people changing their behavior/choice from one category to another is a measure of force and the goal is to obtain estimates of mass for the two categories and an estimate of the distance between them. In order to provide a better description of the data dynamic masses and dynamic positions will be introduced. After laying out the basic idea, relationships with other models, identification issues, generalizations to three time points and some problems will be discussed.

    Mark de Rooij is Assistant Professor at the Department of Methodology and Statistics for Psychological Research. The subject of his research is longitudinal categorical data analysis, in particular visualisation techniques for such data based on the multidimensional scaling family. For more information see www.leidenuniv.nl/fsw/mderooij.

    Paul van Geert (University of Groningen)

    Dynamic systems approaches to long-term change and development: principles, models and analysis of data

    A dynamic system can be defined as ‘a means of describing how one state develops into another state over the course of time’ (Weisstein, 1999). Such a dynamic system must be studied as a process over time occurring in an “individual” (i.e. the unit of analysis, which includes individual persons, but also groups, as in group dynamics). The system can be described in the form of a characteristic equation, namely yt+1 = f (yt). If recursively applied, the system results in the description of a time evolution, namely a series of time steps at t, t+1, t+2 and so forth (in the limit also applying to infinitesimal steps). The function f specifies the evolution rule or dynamic rule, as it applies to a particular “individual”.

    The goal of a dynamic systems approach to long-term change and development is to specify the form of f, given a set of available state data and a theory or conceptual frame suggesting an explanation for how one state develops into another one. In principle, the data should preferably be time-serial, but in some cases also cross-sectional data can provide qualitative indicators (“flags”) for some underlying dynamics. In the context of human development or long-term behavioral change, as in clinical-psychological intervention, the study of dynamic systems poses particular problems. These problems are related to the complexity of the phenomena under study on the one hand and to the problems of data collection on the other hand. Data collection problems refer, among others, to the nature of the data. First, these data are the result of short-term dynamics, for instance the dynamics of test-administration, or the person-context dynamics during an observational study. Second, these data are often difficult to collect, and processes must thus be reconstructed on the basis of relatively small time-series, for instance.

    Although I believe that, one day or another, social and behavioral scientists will have to complement their (multi-)variate approaches with a dynamic systems approach in order to attain satisfactory descriptive and explanatory adequacy, the problems with the application of a dynamic systems approach to development and long-term change are still gigantic. I will present some examples of research carried out in the Groningen Developmental Research Program in an attempt to demonstrate how the dynamic systems approach may be applied to the peculiar kind of data that is characteristic of many areas of developmental and change studies.

    Paul van Geert (1950) studied Psychology and Educational Sciences at the University of Ghent, Belgium, with a dissertation on "Language development in the light of cognition and perception" (1975). In 1976 he became lecturer at the University of Groningen and was appointed professor of developmental psychology in 1985. In 1978-1979 he was a fellow at NIAS, in a project group consisting, among others, of J. Bruner, M. Bowerman and D. Olson. In 1992-1993, he was awarded a fellowship at the Center for Advanced Studies in the Behavioral Sciences, Stanford, California. He has been visiting professor at the Universities of Paris and Turin and has a long-standing close collaboration with the Harvard Graduate School in Educational Sciences. In 1998-2000, he acted as scientific coordinator, with Jacques Lautrey and Bernard Mazoyer, of the Program on Invariants and Variability in the Cognitive Sciences, as part of the Action Concertée Cognitique, financed by the French Ministry of Scientific Research. Since 2002 he has been a lecturer at the Summer Institute on Mind, Brain and Education at Harvard University. Paul van Geert’s major interest lies in the further elaboration of dynamic systems models of development in general. His current work encompasses three complementary approaches to the application of dynamic systems to development, namely mathematical model building, empirical research on developmental processes and the development of statistical simulation methods. Van Geert’s empirical research is focused on longitudinal studies in childhood and covers various fields: cognitive change, language and social and socio-cognitive development, with a special focus on (intra-)individual variability and its relation with the underlying dynamics of development. Although this research is primarily of a fundamental nature, there is a recent shift towards applied research settings, namely educational contexts, where the theoretical problems of understanding the dynamics are directly related to the practical and applied problems of managing and guiding these dynamics.

    Voorjaarsbijeenkomst 2006

    De aanstaande voorjaarsbijeenkomst op vrijdag 28 april heeft als thema meten. De bijeenkomst wordt gehouden in aan de Erasmus Universiteit Rotterdam op de Woudensteijn campus. Verschillende aspecten van meten zullen op deze dag aan de orde komen. Er zijn dit keer twee hoofdsprekers: Wim van der Linden zal spreken over adaptive testing en de huidige IFCS president David Hand zal spreken over meten en kwantificeren in verschillende disciplines. De overige sprekers zijn Paul Krabbe, Christiaan Heij, Gert Jacobusse en Marike Polak. Het belooft een mooie dag te worden.

    10.00 COFFEE
    10.30 Wim van der Linden Statistical Issues in Adaptive Testing
    11.30 Paul Krabbe Scaling Models to Quantify Health States
    12.05 LUNCH
    13.15 Christiaan Heij Macro Economic Forecasting with Many Predictors
    13.50 Gert Jacobusse A scale for measuring development of children aged 0-2 years
    14.25 Marike Polak Testing single-peakedness of item responses
    15.00 TEA
    15.30 David Hand Size Matters
    16.30 Ledenvergadering
    16.45 DRINKS

    Locatie: Erasmus Universiteit, Rotterdam: Woudenstein, Gebouw C, zaal B4 (kaart). Zie ook http://www.eur.nl/eur/bereiken


    De toegang tot de bijeenkomst is vrij. De lunch kost 10 euro, ter plaatse te voldoen. Opgave kan tot en met dinsdag 25 april bij Marieke Timmerman (m.e.timmerman@rug.nl). Mocht u GEEN lunch willen gebruiken, wilt u dat dan even aangeven bij opgave?

    ABSTRACTS

    Wim J. van der Linden (University of Twente)

    Statistical Issues in Adaptive Testing.

    It has been known since the first standardized intelligence test (Binet & Simon, 1905) that the accuracy of test scores can be improved greatly by adapting the test items to the test taker’s responses. But we had to wait until the development of realistic probabilistic response models in psychometrics before these intuitive notions of adaptation could be formalized. The first large-scale applications of adaptive testing were introduced in the early 1990s when computers became both affordable and powerful enough for real-time parameter estimation. These applications led to numerous new research topics, such as adaptive testing with large numbers of content constraints, statistical protection of test-item security, control of the speediness of the test, adaptation with respect to multidimensional abilities, optimal sampling for item calibration, equating of scores on adaptive tests and linear reference tests, use of response times on the items, modeling of “item cloning”, item-pool design, and the detection of aberrant response behavior. We discuss several of these topics, drawing freely on recent developments in Bayesian analysis, hierarchical modeling, and combinatorial optimization.

    Wim J van der Linden is Professor of Measurement and Data Analysis, Faculty of Behavioral Sciences, University of Twente, The Netherlands. His research interests include test theory, computerized adaptive testing, optimal test design, test equating, and response-time modeling. He is editor of Handbook of Modern Item Response Theory (New York: Springer, 1997; with R. K. Hambleton), and Computerized Adaptive Testing: Theory and Applications (Boston: Kluwer, 2000; with C. A. W. Glas), author of Linear Models for Optimal Test Design (New York: Springer, 2005), and currently works on Introduction to Test Theory and its Applications and Elements of Adaptive Testing (with C. A. W. Glas), also to be published by Springer. He has served on the editorial boards of several international journals, has been a member of boards and committees of numerous national and international testing organizations, and is editor for the Springer series Statistics for Social and Behavioral Sciences. He is also a former President of the Psychometric Society and a Fellow of the Center for Advanced Study in the Behavioral Sciences, Stanford, CA.

    Paul F.M. Krabbe (Dept. Medical Technology Assessment, Radboud University Nijmegen)

    Scaling Models to Quantify Health States

    The goal of all health care services activities and programs is to improve or sustain the health of people. Thus, it is not surprising that, over the years, there has been considerable interest and activity in developing methodologies to measure quantitatively the health state of patients and populations.

    So far, the present methodology to do this has been dominated by theories and measurement techniques from (health) economists. Apart from the fact that the economic measurement techniques are not very practical to conduct, numerous empirical studies have shown that these techniques are affected by several biases and axiomatic violations.

    Scaling models developed by psychometricians and others are based on a combination of simple measurement tasks and specific data analysis. The attractiveness of the use of scaling models (e.g., Thurstone scaling, Rasch model) in the case of quantifying health states is based on the uncomplicated and cognitively simple judgment tasks (ranking, choices) that guarantee response data of good quality. These data may provide enough information, after additional analytical computations, to arrive at quantitative measures for health states at a group level.

    Although scaling models have been applied with considerable success in research areas such as educational measurement and environmental evaluation, they have hardly been explored nor applied in the field of medicine. Scaling models that are in theory suitable for quantifying health states will be briefly discussed. In the next section we will present first results from our own study.

    Dr. Krabbe (psychologist/methodologist) is positioned at the department of Health Technology Assessment at the Radboud University Nijmegen Medical Centre. His main scientific interest is on outcome measurement methodology and evaluation research in the field of medicine. He is an expert in ‘health-related quality of life’ and the ‘quality-adjusted life years’ (QALY) model.

    His research activities focus on quantification of health states and measurement methodology. Areas of recent research include biases related to the visual analogue scale, the use of singular value decomposition to reveal valuation structures, and the validity of the concept of responsiveness (or sensitivity) as a distinct measurement property. During his visit in 2004 at the Harvard Initiative for Global Health institute he has explored the suitability of scaling models to quantify health states.

    Dr. Krabbe’s methodological work has appeared in Medical Decision Making, Journal of Quality of Life Research, Health Economics, Journal of Clinical Epidemiology, Social Science & Medicine, and Medical Care.

    Christiaan Heij, Dick J. van Dijk, Patrick J.F. Groenen (Econometric Institute, Erasmus University Rotterdam)

    Macro Economic Forecasting with Many Predictors

    Economic decisions are often based on future expectations, for instance, on price movements and production developments. Therefore it is of interest to predict such key economic variables to support decision making, with time horizons ranging from several months to a few years. A simple forecast method is to use autoregressive (AR) models, which use only lagged values of the predicted variable as predictors. Another option is to employ predictors that are suggested by economic theory, for instance, employment to predict inflation (the so-called Phillips curve). As the number of potentially relevant predictor variables may be large, there is an increasing interest in purely data-driven methods that combine the forecast information contained in all predictor variables. Examples of such methods are forecast combination methods and principal component regression, see [3, 4]. In this presentation, we will consider alternative methods to construct principal components and we will discuss the method of principal covariate regression, see [1]. Selection of the forecast model involves both the choice of lag structure (by means of BIC or cross validation) and the estimation of the parameters of the forecast equation. The methods are evaluated by means of their out-of-sample forecast quality for a set of key macro economic and financial variables relating to inflation and production in the USA, see [2].

    [1] De Jong, S., and H.A.L. Kiers (1992), Principal covariate regression, Chemometrics and Intelligent Laboratory Systems 14, pp. 155-164.
    [2] Heij, C., D.J. van Dijk and P.J.F. Groenen, Improved construction of diffusion indexes for macroeconomic forecasting, Econometric Institute Report EI 2006-03, 2006.
    [3] Stock, J.H., and M.W. Watson (1999), Forecasting inflation, Journal of Monetary Economics 44, pp. 293-335.
    [4] Stock, J.H., and M.W. Watson (2002a), Forecasting using principal components from a large number of predictors, Journal of the American Statistical Association 97, pp. 1167-1179.

    Christiaan Heij is currently assistant professor in econometrics and statistics at the Econometric Institute of the Erasmus University Rotterdam. His PhD thesis ‘From Data to Model’ (Groningen, 1988) was in the areas of econometrics and mathematical systems theory, and his current research interests are in the areas of applied statistics and (macro) econometrics. He published in varied journals and is (co-)author of four books. Apart from his research, he finds much pleasure in teaching econometrics and statistics.

    Gert Jacobusse (TNO Quality of Life)

    A scale for measuring development of children aged 0-2 years

    Development of young children is often measured using qualitative developmental markers. Sometimes age specific standardized scores are used, but these fail to have a common metric that allows comparison of developmental scores across age. We developed a quantitative developmental score (D-score) with improved measurement characteristics.

    The basis assumption of the D-score is the existence of a common continuous scale for the development of young children. Application of the Rasch Model to data on the “Van Wiechen scheme” resulted in excellent reliability and satisfactory fit. This indicates that the new quantitative D-score succeeds in representing outcomes of the instrument on a common interval scale.

    The definition of the D-scores is not specific to age, so the D-score can be used to monitor development and evaluate developmental velocity on the individual level.

    Gert Jacobusse graduated in psychology (Leiden University) in 2003, and now works as a statistician at TNO Quality of Life. He is involved in statistical applications in the field of prevention and health. His own research focuses on Item Response Theory and Computerized Adaptive Testing for measuring health outcomes.

    Marike Polak, Willem J. Heiser, Mark De Rooij (Department of Psychology, Leiden University)

    Testing single-peakedness of item responses

    Several researchers have developed models for single-peaked data. However, for practical researchers it is often difficult to decide whether or not their data are single-peaked.

    We developed a method for item analysis of single-peaked items based on the criterion of irrelevance by Thurstone and Chave (1929). This is a graphical method for evaluating the “relevance” of a given dichotomous attitude item a, where scale values of all items are plotted against the conditional probability of endorsing another item given that a subject endorses item a. The more the diagram shows a peaked pattern with the peak located at the scale value of item a, the more “relevant” item a is.

    We generalized this method to polytomous items and quantified the “relevance” by fitting a normal curve. The resulting goodness of fit was used as a test for single-peakedness. Furthermore, a measure of fit for the scale as a whole is suggested.

    The properties of this method were explored using data generated with a well-developed unfolding model called GGUM (Roberts et al., 2000). We varied sample size, number of items, item discrimination, and category thresholds. Evidence is presented that shows this method distinguishes single-peaked items from monotonic or cumulative items.

    Marike Polak is a PhD student and teacher at the Department of Research Methodology and Statistics for Psychology at Leiden University. Her research project concerns item analysis of single-peaked item response data. The aim is to contribute to the development of an alternative approach to item analysis based on classical test theory, which handles monotonic data. Her interest lies in the application of various data analysis techniques to social scientific data.

    David Hand (Department of Mathematics, Imperial College London, U.K.)

    Size matters

    The ideas of measurement are so ubiquitous that we often fail to notice them: they are simply parts of the conceptual universe in which we function. However, it has not always been thus and sometimes, even now, rips in this usually unnoticed background fabric appear, casting doubts on one's view of the way the world works. Occasionally these tears have serious, even fatal consequences. This talk looks at the conceptual infrastructure of quantification, showing how humans have constructed it, how it can be interpreted, and how it is manipulated to make valid inferences about the real world. The talk is illustrated with measurement tools from psychology, medicine, physics, economics and other areas.

    David Hand is Professor of Statistics at Imperial College London, where he is Head of the Statistics Section in the Mathematics Department and also Head of the Mathematics in Banking and Finance programme in the Institute for Mathematical Sciences. He is the President of the International Federation of Classification Societies for 2006-7. He has published twenty three books on statistics and related areas and has particular interests in classification, data mining, the foundations of statistics, and applications in finance and medicine.

    Najaarsbijeenkomst 2005 

    De komende najaarsbijeenkomst wordt gehouden in Zeist bij TNO Kwaliteit van Leven. Er ligt een goed programma voor met twee vermaarde buitenlandse sprekers. Michael Greenacre (UPF, Barcelona), één van ’s werelds experts op het gebied van correspondentieanalyse, zal spreken over enige controverses van deze techniek. De tweede buitenlandse spreker, John Gower (Open University, Milton Keynes), zal samen met Garmt Dijksterhuis ingaan op Procrustesanalyse waarover zij onlangs een boek gepubliceerd hebben. De rest van de najaarsbijeenkomst wordt gewijd aan een praktische vergelijking van verschillende classificatietechnieken met het doel zieke patiënten te kunnen onderscheiden van gezonde personen.

    10.00 COFFEE
    10.30 Johh Gower Procrustes problems - An overview
    11.15 Garmt Dijksterhuis An application of Generalised Procrustes Analysis as a method to compare data sets collected by different methods
    12.00 Age Smilde, Chris de Koster Proteomics based clinical biomarkers: how to distinguish healthy from diseased? Introduction of the shoot-out
    12.30 LUNCH
    13.30 The shoot-out continues
    13.30 Wies Akkermans Support vector machines
    13.45 Carina Rubingh Principal discriminant variates
    14.15 Paul Eilers Penalized logistic regression
    14.30 Theo Reijmers Nearest shrunken centroids
    14.45 Margriet Hendriks Logit boost
    15.00 Discussion
    15.25 TEA
    15.45 Michael Greenacre Tying up the loose ends of (simple) correspondence analysis
    16.45 DRINKS

    Location: TNO Quality of Life, Zeist (route html, route pdf)

    Attendance is free and open to anyone interested, but registration is mandatory (via email to Marieke Timmerman before Thursday November 10). The lunch is to be paid cash at the meeting. When registering, please indicate whether you would like to join lunch.

    ABSTRACTS

    John C. Gower (The Open University, Milton Keynes, U.K.)

    Procrustes Problems - An overview

    The basic two-sets Procrustes problem is to match given matrices X1, X2 via a transformation X1T, where T is constrained in some specified way and has to be estimated. The matching may be by least-squares: Min||X1T –X2||2 or by maximising the inner-product trace(X2'X1T) or by several other criteria. Typical constraints on T are that T = Q (Orthogonal), T = P (Projection), T = C (Direction Cosines) but there are other possibilities. Variations include “two-sided Procrustes” Min||X1T1 –X2T2||2 and “Double Procrustes” Min||T2X1T1 –X2||2 . Then, we may add isotropic scaling Min||sX1T –X2||2 or anisotropic scaling Min||SX1T –X2||2 where S is an unknown diagonal scaling matrix which may appear in other positions too; alternatively S may be replaced by known weights, not necessarily diagonal. Rather than two matrices, we may have Generalised Procrustes Analysis, where K matrices X1,X2,X3,….,XK are to be matched simultaneously. This may be regarded as a three-mode problem and has relationships with other three-mode Individual Scaling models and with Generalised Canonical Correlation. I shall attempt to thread my way through this minefield; Garmt Dijksterhuis will discuss applications.

    Professor John Gower graduated in Mathematical Statistics (with distinction) at the University of Manchester. He worked in applied multivariate analysis, particularly on classification problems and graphical methods for exposing structure in data involving observations on many variables. In the course of this work he developed several methods that are now widely used, including contributions to measures of similarity, classification methods, metric multidimensional scaling, Procrustes analysis, the analysis of asymmetry and, more recently, developing a unified theory of biplots. He gained individual merit promotion in 1970 (equivalent to a personal chair in a university) and in 1984 became head of the Biomathematics Division, which included the Statistics and Computer Departments. Since retirement from Rothamsted in March 1990 at the mandatory age of 60 he has held several visiting appointments, notably in the Department of Data Theory of the University of Leiden (1991-1993) and, at the Universities of Dortmund and Salamanca. In 1994, he joined the Statistics Department of the Open University and in 1997 was awarded the title of Professor. He has nearly 170 publications, including the first monograph on biplots - Gower, J. C. and Hand, D. J. (1996) Biplots, (Monographs on Statistics and Applied Probability, London: Chapman and Hall (277 pages)) and a monograph, with G.B. Dijksterhuis on Procrustes Problems 2004, Oxford University Press (247 pages).


    Garmt Dijksterhuis (Research institute Agrotechnology and Food Innovations, Wageningen University and Faculty of Economics, University of Groningen)

    An application of Generalised Procrustes Analysis as a method to compare data sets collected by different methods

    A group of 207 subjects scored a set of associations to logos using brand personality items. This was done with the same set of 13 logos under the instructions that the logos belong to a particular product category. The exercise was repeated for four different product categories. In addition a set of 20 subjects sorted the logos into a number of groups, under no instruction at all, other than to freely group the logos. A Procrustes matching of the configurations of associations for the four product categories showed similar configurations of logos, so a group average configuration is representative for each of the four configurations. This group average is subsequently matched to the MDS configuration based on the free grouping. The match shows two significantly different configurations. We conjecture that the free grouping task taps a different process than the association scoring task. In the latter the subjects are guided by the meaning of the association items, in the former no interpretation is needed. The free grouping shows a more ‘pure’ perceptional result than the association task which always includes interpretation of verbal labels.

    Garmt Dijksterhuis is a psychologist and methodologist. He studied theoretical and experimental psychology and psychology of perception at the University of Utrecht and wrote his Ph.D. dissertation at the department of Data Theory at the University of Leiden, in the Netherlands. Garmt has written or co-authored over a hundred publications in sensory and consumer science, statistics and psychology. He is one of the founders of the sensometric society (www.sensometric.org) and is its current chair, a member of the editorial board of the journal Food Quality and Preference and chair of the sensory science branch of the Dutch marketing research association (MOA.nl). Garmt taught courses in consumer and sensory science and methodology and related topics and has been a guest scientist at several universities and research institutes and an invited lecturer at many occasions. Currently he is employed as a senior scientist at the department Consumer and Market Insight of the research institute Agrotechnology and Food Innovations (Wageningen University and Research Centre), and as an associate professor at the Marketing department of the Faculty of Economics at the University of Groningen. His main research interests are the psychology of perception and appreciation, and in particular the impact of the emotion-cognition controversy on choice behaviour and on research methodology.



    Age Smilde (Biosystems Data Analysis, Swammerdam Institute for Life Sciences, University of Amsterdam)

    Carina Rubingh (Analytical Sciences, TNO Quality of Life)

    Theo Reijmers (Groningen Bioinformatics Centre, University of Groningen; Analytical Biosciences, Leiden/Amsterdam Centre for Drug Research, Leiden University)

    Wies Akkermans (Biometris, Wageningen University and Research Centre)

    Paul Eilers (Department of Medical Statistics, Leiden University Medical Centre)

    Huub Hoefsloot (Biosystems Data Analysis, Swammerdam Institute for Life Sciences, University of Amsterdam)

    Suzanne Smit (Biosystems Data Analysis, Swammerdam Institute for Life Sciences, University of Amsterdam)

    Margriet Hendriks (Metabolomics Centre, Academic Biomedical Centre, University of Utrecht)

    Chris de Koster (Mass Spectrometry of Biomacromolecules, Swammerdam Institute for Life Sciences, University of Amsterdam)

    Hans Aerts (Biochemistry, Amsterdam Medical Centre)

    Proteomics based clinical biomarkers: how to distinguish healthy from diseased?

    Proteomics is a new genomics technique regarding the measurement of proteins in different samples, such as body fluids, tissue, cells, etc. One of the applications of proteomics is in obtaining insight in de development of diseases and of diagnosing diseases and their severity on the protein level. In our study a cohort of healthy persons is used as a control, and these are confronted with patients with Gaucher’s disease. Of these two groups blood samples are available. These are measured with Surface Enhanced Laser Desorption Ionization Mass Spectrometry (SELDI-MS); a relatively new way of performing a proteomics measurement. Hence, the problem comes down to discriminating control versus diseased persons on the basis of their SELDI-MS spectra.

    These SELDI-MS spectra, however, generate an abundance of data: very many variables are measured for a single sample. This poses challenges to the subsequent data analysis. These challenges are: i) how to avoid overfitting, ii) which discrimination method to use, iii) how to perform variable selection, iv) how to assess the quality of the model and discrimination rule.

    To answer some of the questions mentioned above we organized a ‘shoot-out’: every participant used his/her own favorite method on the same data set. A protocol was developed regarding the setup of the calculations (e.g. how to do the validation) in order to make the final results comparable. The methods used include: Nearest Shrunken Centroids, Principal Components Discriminant Analysis, Principal Discriminant Variates, LogitBoost, Penalized Logistic Regression and Support Vector Machines. All these methods were used and evaluated according to the previously developed protocol.

    In a series of presentations, the results of this ‘shoot-out’ will be presented. First, the background of Gaucher’s disease and of SELDI-MS will be shortly sketched, followed by the setup of the comparison. Then, in short presentations, each team member will present his/her method and the result. The series will be closed by an overall presentation of the results. Differences and agreements will be discussed.



    Michael Greenacre (Departament d’Economia i Empresa, Universitat Pompeu Fabra, Barcelona)

    Tying up the loose ends of (simple) correspondence analysis

    Although correspondence analysis is now widely available in computer software packages and applied in a variety of contexts, notably the social and environmental sciences, there are still some misconceptions about this method as well as unresolved issues which remain controversial to this day. In this seminar we hope to settle several of these matters, namely (i) the way CA measures variance in a two-way table, (ii) the influence, or rather lack of influence, of outliers in CA maps, (iii) the issue of the scaling of maps and their interpretation, and (iv) whether or not to rotate the CA solution. Two examples are used as illustrations of the theory, one from linguistics and the other from marine biology.

    Michael Greenacre is Professor of Statistics at the Universitat Pompeu Fabra, Barcelona. His main research interests are in multivariate data analysis in the social and environmental sciences, especially the analysis of categorical data. He has published two books on correspondence analysis and co-edited three volumes on the analysis and visualization of categorical data. Apart from his university teaching he has taught short courses internationally, on correspondence analysis in the social sciences and multivariate analysis for environmental biologists.



    Voorjaarsbijeenkomst 2005: Support vector machines

    De komende voorjaarsbijeenkomst zal in het teken staan van Support Vector Machines (SVM), een nieuwe klasse van methoden voor het classificeren van informatie. Binnen de VOC worden SVM's nog maar sporadisch toegepast. De voorjaarsbijeenkomst heeft tot doel de ins en outs, en de toepassingen van SVM's in den lande, breder bekend te maken.

    Datum: Vrijdag 15 april 2005

    Locatie: TNO Kwaliteit van Leven, Gorter gebouw, Wassenaarseweg 56, 2333 AL Leiden (route)

    10.00-10.30 Welkom en koffie
    10.30-11.35 Johan Suykens (K.U. Leuven)
    11.35-12.25 David Tax (Delft University of Technology)
    12.25-13.30 Lunch
    13.30-14.05 Elena Marchiori (Free University Amsterdam)
    14.05-14.40 Georgi Nalbantov (Erasmus University Rotterdam)
    14.40-15.00 Thee
    15.00-15.35 Cajo Ter Braak (Plant Research International Wageningen)
    14.35-16.00 Ledenvergadering
    16.00 Borrel

    Gaarne inschrijven via Marieke Timmerman (gratis). Inlichtingen zijn verkrijgbaar bij Stef van Buuren

    ABSTRACTS


    Johan Suykens (K.U. Leuven, Belgium)

    Support Vector Machines and Kernel Based Learning: An Introduction

    The use of kernel based learning techniques has become increasingly popular, largely stimulated by the work on support vector machines as originally introduced within the context of statistical learning theory. Standard support vector machines (SVM) for classification and regression lead to solving convex optimization problems. The problem formulation and solution is characterized by a primal and dual problem where multilayer neural network interpretations can be given in both worlds. In contrast with many classical models, support vector machines are able to learn and generalize in huge dimensional input spaces. The method makes use of a high dimensional feature map (which should not be explicitly known) in relation to a Mercer kernel (the so-called kernel trick). Notions as large margin classification and regularization techniques play an important role at this point. The method can also be viewed as a functional optimization problem in reproducing kernel Hilbert spaces (RKHS).

    The kernel trick has been further applied to a wider variety of problems such in kernel Fisher discriminant analysis (KFDA), kernel principal component analysis (KPCA), kernel canonical correlation analysis (KCCA), kernel partial least squares (KPLS), kernel clustering and others. Optimization formulations with primal and dual problem has been given with least-squares support vector machines (LS-SVM), thereby extending support vector machine methodology to a wider range of problems for regression, classification, supervised and unsupervised learning, recurrent networks and control. One often has the choice to either solve the primal or the dual problem depending on the dimensionality of the feature map (or its approximation) and select the most suitable representation for the given problem at hand (e.g. high dimensional input space or large data sets). Kernels have also been customized towards specific application areas such as in textmining, bioinformatics or in relation to graphical models. Novel techniques of hierarchical kernel machines even allow to find the model together with the tuning parameters by solving a convex problem. At the same time one can achieve in this way e.g. sparse representations, stability of learning machines and input variable selection.

    In this talk we outline the main concepts of support vector machines and kernel based learning and show successful real-life applications being studied at KU Leuven including time-series prediction, nonlinear modelling, classification of NMR data and microarray data analysis.

    Johan A.K. Suykens is an associate professor with K. U. Leuven Belgium. His research interests are mainly in the areas of the theory and application of neural networks and nonlinear systems. He is author of the books "Artificial Neural Networks for Modelling and Control of Non-linear Systems" (Kluwer Academic Publishers) and "Least Squares Support Vector Machines" (World Scientific) and editor of the books "Nonlinear Modeling: Advanced Black-Box Techniques" (Kluwer Academic Publishers) and "Advances in Learning Theory: Methods, Models and Applications" (IOS Press). In 1998 he organized an International Workshop on Nonlinear Modelling with Time-series Prediction Competition. He is a Senior IEEE member and has served as associate editor for the IEEE Transactions on Circuits and Systems - Part I (1997-1999) and Part II (since 2004) and since 1998 he is serving as associate editor for the IEEE Transactions on Neural Networks. He received an IEEE Signal Processing Society 1999 Best Paper Senior Award and several Best Paper Awards at International Conferences. He is a recipient of the International Neural Networks Society INNS 2000 Young Investigator Award for significant contributions in the field of neural networks. He has served as Director and Organizer of a NATO Advanced Study Institute on Learning Theory and Practice (Leuven 2002) and as a program co-chair for the International Joint Conference on Neural Networks IJCNN 2004. Further information: http://www.esat.kuleuven.ac.be/sista/members/suykens.html and http://www.esat.kuleuven.ac.be/sista/lssvmlab/.


    David Tax (Delft University of Technology)

    Learning in high dimensional feature spaces

    In the traditional statistical pattern recognition and machine learning one used to be very careful in sampling the data distributions dense enough in order to reliably estimate the relevant probability density distributions. In the current practice, it also happens that large numbers of features are measured for each object under investigation. For instance in the expression of genes in different types of tissue, or in the recordings of images of faces. In order to generalize at all in these high dimensional features spaces, the complexity of the data model should be severely restricted. A model that was introduced recently, the support vector classifier, explicitly minimizes its complexity (in terms of the so-called VC dimension). In this talk I would like to discuss the background of this model, the basis for its success, the caveats in its application and some derivative models.

    David M.J. Tax is post doc at the Pattern Recognition group at the Delft University of Technology. His main research interest is one-class classification, and his is author of the toolbox DD_Tools, but actually, he likes any learning method that does not include tons of 'magic' parameters. Current research projects include PRTools, one-class classification, geological Spectral Analysis, and the Detection of Lung Diseases in CT Scans. Further information: http://www-ict.ewi.tudelft.nl/~davidt/ and http://ict.ewi.tudelft.nl/index.php?option=com_contact&task=view&id=38.


    Elena Marchiori (Vrije Universiteit Amsterdam)

    Support Vector Machines for tumor diagnostic and biomarker detection with pattern proteomic data

    We describe methods based on support vector machines (SVMs) for tumor diagnostic and potential biomarker detection with pattern proteomic data. Applications to real-life datasets indicate that SVMs provide effective tools competitive to commercial software designed for this type of tasks.

    Elena Marchiori is assistant professor at the Department of Computer Science, Vrije Universiteit Amsterdam. Her research interests include machine learning techniques for learning and optimization, in particular evolutionary algorithms and support vector machines. Further information: http://www.cs.vu.nl/~elena/.


    Georgi Nalbantov (Erasmus University Rotterdam)

    Support Vector Regressions and their application in finance

    Support Vector Regressions (SVR) are a nonparametric tool for function estimation. In SVR the data from the original space are mapped into a higher-dimensional space, where a linear decision surface is found, which corresponds to a nonlinear one in the original space. SVR are known for their good generalization performance and the guarantee of a global solution that results from solving a convex optimization problem. Furthermore, robustness of the results is achieved via the employment of the so-called epsilon-insensitive loss function, instead of the common quadratic one. Here, we provide a brief intuitive introduction to SVR and apply them to a financial time series problem.

    Georgi Nalbantov is a Ph.D. candidate at the Erasmus Research Institute of Management and the Econometric Institute at the Erasmus University Rotterdam. He has graduated from Maastricht University with a Master degree in Economics in January 2004. His interests include Machine Learning and Finance, and in particular Support Vector Machines and financial Factor Models.


    Cajo ter Braak (Plant Research International, Wageningen)

    Support Vector Machines as Penalized Regression Method and how it is calculated

    Support Vector Machines fit nicely into the framework of penalized regression. This presentation will highlight this connection, and describe a general method for calculating of the SVM solution based on this connection.

    Cajo J.F. ter Braak is professor of Multivariate statistics for life sciences at Biometris in Plant Research International, Wageningen. Ter Braak is the inventor of canonical correspondence analysis and author of the program Canoco, which is the de-facto standard for ordination of ecological data. Ter Braak publishes in diverse fields like statistical ecology, chemometrics, spatial statistics and statistical genetics. Further information: http://hcr3.isiknowledge.com/author.cgi?&link1=Browse&link2=Results&id=3249 and http://www.canoco.com.



    Jubileumbijeenkomst

    Dit jaar bestaat de VOC 15 jaar. Ter gelegenheid van dit 3e jubileum zal er een bijeenkomst georganiseerd worden die de traditionele najaarsbijeenkomst vervangt. Deze jubileumbijeenkomst zal gehouden worden op

    11 en 12 november 2004 te Driebergen.

    Het thema van de bijeenkomst is alles op zijn plaats waarmee we aandacht willen geven aan het begrip ruimte in statistische analyses. We onderscheiden twee gebieden:

    • De analyse van ruimtelijke gegevens (bijvoorbeeld de analyse van satellietbeelden, of buurtinvloeden in criminaliteit)
    • De analyse van gegevens met ruimtelijke modellen (de meer abstracte ruimtes zoals gedefinieerd in multidimensionele schaaltechnieken).

    Het organiserend commite denkt dat we hiermee een interessante bijeenkomst kunnen maken die in het hart van de VOC ligt. Kortom een programma zoals je dat van de VOC gewend bent: interessante en inspirerende combinaties van toepassing en theorie, op allerlei terreinen. Voorbeelden zijn: taalkunde, criminologie, marketing, remote sensing, gezondheid en psychologie. Bekijk ook de folder voor details en programma.

    De kosten zijn als volgt:

    • VOC-lid: € 165,-
    • Niet-lid: € 195,-. Men wordt automatisch proef-lid van de VOC voor het jaar 2005.
    • Zonder overnachting : € 60,- korting

    Onze bijeenkomst wordt gehouden in het congrescentrum De Bergse Bossen te Driebergen (http://www.debergsebossen.nl/). Comfortabele kamers, een goede verzorging van de inwendige mens en tijd voor verblijf in de wandelgangen maken dit congres naast wetenschappelijk ook sociaal zeer de moeite waard.De Bergse Bossen is goed bereikbaar per openbaar vervoer en treintaxi. Zie ook de routebeschrijving.

    Aanmelding: Stuur een e-mail met Naam, Adres, E-mail, Telefoonnummer naar: Mark de Rooij (rooijm@fsw.leidenuniv.nl)

    Maak het verschuldigde bedrag over naar Giro 161723 van de Vereniging voor Ordinatie en Classificatie te Naarden, onder vermelding van ‘Jubileumcongres’. Belgische deelnemers kunnen gebruik maken van Bankrekeningnummer 777-5952385-56, Bacob-bank, t.n.v. VOC te Naarden.

    PROGRAMMA

    Thursday, November 11
    13.50-14.00 Welcome
    14.00-14.40 Elffers Boeven als buurman: twee typen ruimtelijke invloedsmodellen in de criminologie
    14.40-15.20 Heisterkamp Assessing health impact of sources of airpollution using Bayesian space-time models
    15.20-16.00 Wehrens Clustering image data
    16.00-16.30 Pauze
    16.30-17.30 Kiers Visualizing dependency of bootstrap confidence intervals for methods yielding spatial configurations
    17.30-18.30 Drinks
    18.30-20.30 Dinner
    20.45-21.45 Heiser From Archimedes to Benzécri: How the center of gravity and the moment of inertia entered into statistics
    Friday, November 12
    9.00-10.00 Buja Nonlinear dimension reduction
    10.00-10.30 Pauze
    10.30-11.10 Bijmolt Country and consumer segmentation: multi-level latent class analysis of financial product ownership
    11.10-11.50 Van Deun Spatial representation of preference and other ranking data
    11.50-12.30 Van de Velden Correspondence analysis of rating data
    12.30-13.30 Lunch
    Debba Segmentation techniques applied in deriving an optimal sampling scheme
    14.10-14.50 De Gruyter Geostatistical classification of agricultural fields on the basis of nitrate leaching
    14.50-15.30 Lesaffre Correcting for misclassification in caries research
    15.30-? Koffie

    Henk Elffers & Wim Bernasco (NSCR)
    Boeven als buurman: twee typen ruimtelijke invloedsmodellen in de criminologie

    Abstract volgt

    Prof. dr. Henk Elffers (1948) studeerde wiskundige statistiek en waarschijnlijkheidsrekening aan de Universiteit van Amsterdam en promoveerd op een fiscaal-psychologisch onderwerp aan de Erasmus Universiteit Rotterdam. Hij is thans themacoordinator van het onderzoeksprogramma 'Spreiding en Verplaatsing van Criminaliteit' aan het Nederlands Studiecentrum Criminaliteit en Rechtshandhaving NSCR te Leiden en hoogleraar rechtspsychologie aan de Universiteit Antwerpen. Zijn onderzoeksbelangstelling gaat uit naar de psychologie van de regelnavolging, rationele keuzetheorie, ruimtelijke aspecten van criminaliteit en rechtshandhaving, en naar de rol van statistiek in het strafproces.

    Siem Heisterkamp (RIVM)
    Assessing health impact of sources of airpollution using Bayesian space-time models

    We use spatio-temporal models to relate hospital discharge for acute myocardial infarction and bronchitis in the years 1991-1993 to noise and distance from Schiphol airport. The goodness of fit of the different spatial models was assessed using expected predicted deviance. In this paper we will explain why these models are used in epidemiology, how they are used and what we can learn from it.

    Simon Heisterkamp is senior-statistician at the National Institute for Public Health and the Environment in Bilthoven (The Netherlands). His interests are in Bayesian statistics and its application in spatial statistics, prediction using time series of infectious disease data and analysis of micro-array data.

    Ron Wehrens (KU Nijmegen)
    Clustering image data

    Automatic segmentation of multivariate images can be achieved by clustering individual pixels. In this presentation, I will focus on model-based clustering, where the data are described by a mixture of multivariate normal distributions. This is a versatile and easily applicable method which gives suggestions on the optimal clustering model, and information on the uncertainty in specific regions of the image. In many applications, these are properties of tremendous importance. Several other characteristics of clustering multivariate images deserve attention. First, because of the sheer size of images, many clustering methods are not directly applicable. Model-based clustering, for instance, is quite slow, and in some implementations uses a hierarchical clustering for the initialisation. We will show how to deal with this. Second, it may be difficult to detect small clusters; these tend to be overwhelmed by the sheer amount of pixels in larger clusters. For this, an iterative strategy has been developed. And finally, in some cases significant improvements may be obtained by incorporating spatial information, i.e. information on the location of the pixel in the image or information on the classification of neighbouring pixels.

    Ron Wehrens (KU Nijmegen). Ron Wehrens is verbonden aan de vakgroep Analytische Chemie van de Radboud Universiteit Nijmegen. Zijn onderzoek beweegt zich in het veld van de chemometrie, dat wil zeggen de toepassing van multi-variate statistiek en globale optimalisatie op chemische systemen. Voorbeelden van toepassingen zijn clustering van moleculen, het voorspellen van chemische of biologische activiteiten, of identificatie en quantificatie van stoffen, meestal op basis van verschillende soorten spectrale informatie.

    Henk Kiers (RU Groningen) & Patrick Groenen (Erasmus Universiteit)
    Visualizing dependency of bootstrap confidence intervals form methods yielding spatial configurations

    Several techniques exist for summarizing data by means of a graphical configuration of points in a low-dimensional space. The most common examples are multidimensional scaling (MDS) and principal component analysis (PCA). Usually, such analyses are applied to data for a sample drawn from a population, while the researcher often hopes that the configuration (at least roughly) holds for the full population. For instance, a PCA may be carried out on the scores of a sample of subjects on a particular set of variables, while, when PCA is used to display the relations between the variables in a plot, it is hoped that this plot also holds (roughly) for the whole population. To assess how accurate the sample based plot is as a representation for the population, confidence intervals or ellipsoids can be constructed around each plotted point (representing a variable). For this purpose, it has been proposed to use a bootstrap procedure. This procedure gives a full configuration for each bootstrap sample, so we end up with a great many configurations that jointly display variation of the configuration of the variables upon resampling. Usually, the variation is displayed by considering different variables separately. That is, for each individual variable, its location in the low-dimensional space in all bootstrap samples is assessed, and the variation of these locations is represented, for instance, by confidence ellipsoids. However, such a procedure ignores the dependency of variation of different variables across bootstrap samples. To display how variation of different variables depends on each other, we propose to visualize bootstrap configurations in a temporally smooth way (movie). Problems encountered then are: How to smooth the transitions from configuration to configuration, and, related to this, how to order the configurations. These problems and some first solutions will be described and demonstrated in the presentation.

    Het onderzoek van Henk Kiers richt zich op technieken voor multivariate data analyse, zoals twee- en drieweg componenten-analyse. Hij is, op dit terrein, hoogleraar bij de vakgroep Psychologie aan de RuG. Als vroegere voorzitter van de VOC en huidige president van de IFCS is hij altijd nauw betrokken geweest bij "Data Analyse en Classificatie".

    Willem J. Heiser (Leiden University)
    From Archimedes to Benzécri: How the center of gravity and the moment of inertia entered into statistics

    Spatial or geometrical models, such as a Euclidean distance model, a hierarchical tree model, or a factor analysis model, usually serve to account for derived measures of (dis-) similarity, association, or correlation between objects, categories, or variables, respectively. They approximate or highlight aspects of the data. However, there are also basic geometrical models that accommodate all possible data distributions of specific kinds. For any given data set, they express the data geometrically, rather than numerically. Examples of interesting data spaces are the simplex for relative frequencies over a set of categories, and the permutation polytope for rankings of a set of options, but of course, we may simply consider points on a line, too. In this context, the paper looks back into history to trace the origins of two major descriptors of data distributions, the expected value and the variance.

    Willem J. Heiser studied psychology in Leiden and completed his dissertation “Unfolding Analysis of Proximity Data” there in 1981. After a post-doc year at Bell Telephone Labs in Murray Hill, New Jersey, he was appointed professor of data theory at Leiden University in 1989. His research focuses on the analysis of multivariate categorical data using multidimensional scaling and classification techniques. He was invited as a visiting professor by the Universidad de Granada, the Universidad de Santiago de Compostela, the University of Exeter, and the Université de Haute Bretagne. He was elected president of the Psychometric Society (2003-2004), is a former editor of Psychometrika (1995-1999), and is the current editor of the Journal of Classification (2002-present).

    Andreas Buja & Lisha Chen (The Wharton School, University of Pennsylvania, Philadelphia)
    Nonlinear Dimension Reduction

    Nonlinear dimension reduction has been a topic of interest on and off for at least half a century. Among the better known approaches are: the continuous-variable versions of the GIFI system, the PRINCALS program, and multiple correspondence analysis; Kruskal-Shepardmulti-dimensional scaling (MDS) when used for dimension reduction; princial curves and surfaces as introduced by Hastie and Stuetzle. Recently computer scientists in machine learning have proposed novel nonlinear dimension reduction schemes. One proposal, called InfoMap, is just classical Torgerson-Young MDS applied to a novel distance matrix. The other proposal, called 'Locally Linear Embedding' or LLE, is conceptually more novel in that it attempts to recreate local affine relations among neighboring points.
    In this talk we will show that Kruskal-Shepard distance scaling makes a strong competitor of InfoMap and LLE when applied to a localized distance matrix augmented by repulsive force between non-local object pairs. We call the resulting method 'Local MDS' or 'LMDS. Localizing MDS by using only distances between near neighbors has been attempted many times and always been shown to be unstable to the point of uselessness. However, the repulsive force proposed here often does a good job at spreading out and stabilizing point configurations. On some of the illustrative datasets used in the InfoMap and LLE articles, LMDS shows superior performance in that it reveals more detail than the two other methods.

    Andreas Buja is chaired Professor in the Statistics Department at the Wharton School, Univ of Pennsylvania, Philadelphia. Interested in machine learning, in particular boosting, as well as multi-dimensional scaling, multivariate analysis, and data visualization. Previous employment, in reverse order: AT&T Labs, AT&T Bell Labs, Bellcore, Salomon Brothers, Univ. of Washington, Stanford University and Stanford Linear Accelerator (visiting faculty), Children's Hospital of Zurich (research associate).

    Tammo Bijmolt (RU Groningen), Leo Paas & Jeroen Vermunt (Tilburg University)
    Country and consumer segmentation: multi-level latent class analysis of financial product ownership

    The financial services sector has internationalized over the last few decades. Important differences and similarities in financial behavior can be anticipated between both consumers within a particular country and those living in different countries. For companies in this market, the appropriate choice between strategic options and the resulting international performance may critically depend on the cross-national market structure of the various financial products. Insight into country segments and international consumer segments based on domain-specific behavioral variables will therefore be of key strategic importance. We present a multi-level latent class framework for obtaining simultaneously such country and consumer segments. In an empirical study we apply this methodology and several alternative modeling approaches to data on ownership of eight financial products. Information is available for fifteen European countries, with a sample size of about 1000 consumers per country. We find that both country segments and consumer segments are highly interpretable. Also, consumer segmentation is related to demographic variables such as age and income. Our conclusions feature implications, both academic and managerial, and directions for future research.

    Tammo H.A. Bijmolt is Professor of Marketing Research at the Department of Marketing, Tilburg University, The Netherlands; but as of September 2004 at the University of Groningen. His research interest are a fruitful combination of methodology and conceptual marketing issues. Among his major research themes are meta-analysis, multidimensional scaling, and modelling of consumer choice behavior. He published papers in several international journals, such as: Journal of Marketing Research, Journal of Consumer Research, International Journal of Research in Marketing, Journal of Classification, and Multivariate Behavioral Research.

    Katrijn Van Deun (KU Leuven)
    Spatial representa-tion of preference and other ranking data

    Ranking data can be represented in Euclidean space by a high- dimensional structure called the permutation polytope. In this presentation, it will be discussed how low-dimensional representations can be derived from the permutation polytope that hold information on the relation between a judge and the different objects he ranked, and also on the relation among the different objects and among the different judges. First, two representations that are based on projecting the polytope on a low-dimensional subspace are presented: these are known as the principal components biplot and the vector model of unfolding. Here, another type of low-dimensional representation will be introduced that is based on a two-step approach. First, distances are measured in the permutation polytope extended with the objects and second, these distances are subjected to an ordinal multidimensional scaling analysis. The different low-dimensional representations are compared using an empirical example.

    Katrijn Van Deun behaalde haar diploma in de Psychologische Wetenschappen aan de Katholieke Universiteit Leuven. Momenteel werkt zij er aan het departement Psychologie waar zij een proefschrift voorbereidt over het degeneratie-probleem bij multidimensionele ontvouwing. Ze is er ook betrokken bij het onderwijs van multivariate methoden en variantie-analyse.

    Michel van de Velden (Erasmus University)
    Correspondence analysis of rating data

    Correspondence analysis is a popular method for data visualization. The typical formulation and derivation of correspondence analysis is based on the analysis of a two-way contingency table. However, mathematically there are no objections against the application of correspondence analysis to any type of nonnegative data matrix. In this presentation, we consider the analysis of rating data using correspondence analysis. Rating data can be used to identify preferences or perceptions of a group of subjects concerning a set of objects. For example, in marketing research, consumers can indicate preferences concerning products (or product attributes) by assigning ratings. In correspondence analysis the relationships between the products (as indicated in the ratings) are then depicted graphically, allowing a quick and easy interpretation. However, as noted by van de Velden (2000: In “Innovations in multivariate statistical analysis”, eds. R.D. Heijmans, D.S.G. Pollock and A. Satorra) and Torres and Greenacre (2002: International Journal of Research in Marketing), there exist different approaches to correspondence analysis of rating data. Some theoretical aspects of these differences were studied in van de Velden (2004: Journal of Classification). Here we will focus on the practical consequences of the differences.

    Michel van de Velden is werkzaam als post-doc bij het Econometrisch Instituut van de Erasmus Universiteit Rotterdam. In 2000, promoveerde hij aan de Universiteit van Amsterdam op het proef-schrift “Topics in Correspondence Analysis”. Na zijn promotie werkte hij enige tijd aan de Rijksuniversiteit Groningen. Daarna was hij twee jaar werkzaam als Marie Curie Fellow aan de Universitat Pompeu Fabra, te Barcelona. Michel’s onderzoeksinteresses liggen op het gebied van de multivariate statistiek, in het bijzonder, theoretische en praktische aspecten van correspondentie analyse en aanverwante methoden. Zijn werk verscheen onder andere in Linear Algebra and its Applications en Journal of Classification.

    Pravesh Debba, Alfred Stein, Freek van der Meer and Arko Lucieer (ITC International Institute for Geo-Information Science and Earth Observation, Enschede)
    Segmentation techniques applied in deriving an optimal sampling scheme

    An optimized sampling scheme is presented which is useful in selecting samples that represent different categories of interest without any presampling field data. This method uses the iterated conditional modes algorithm (ICM) as an unsupervised segmentation technique to create several homogeneous categories. Within each category, simulated annealing is applied as an optimization technique by minimizing the mean shortest distance between sampling points. The number of sampling points in each category was proportional to the size and variability of the category.
    To test the methodology, a generated image with several categories was used. Most categories resulted in an almost equilateral triangular design of the sampling points, thereby enforcing an even spread of the samples within each category. The derived sample points in effect will have image characteristics, for example, gray tone, texture, reflectivity or pattern, depending on the type of segmentation performed.
    The combination of previously well formulated techniques such as the ICM for image segmentation and simulated annealing for optimized sampling, results in an elegant and powerful tool in designing optimal sampling schemes using remote sensing images.

    Pravesh Debba was born in Durban, KwaZulu-Natal, South Africa in 1969. He received a B.Sc. degree (Mathematics and Statistics) and B.Sc. (Hons)(Statistics) from the University of Durban-Westville, Durban, KwaZulu-Natal, South Africa in 1991 and 1992 respectively. He received an M.Sc. degree in Biostatistics from Limburgs Universitair Centrum, Diepenbeek, Limburg, Belgium in 1998. He started his career as a junior lecturer in the department of statistics at the University of Durban-Westville in 1993 and continued at the University of South Africa, Pretoria, Gauteng, South Africa, from 1994 until 1999. He then joined the School of Mathematical and Statistical Sciences at the University of KwaZulu-Natal, Durban, KwaZulu-Natal, South Africa in 2000, where he is presently employed as a lecturer. He is currently pursuing the Ph.D. degree at the ITC International Institute for Geo-Information Science and Earth Observation, Enschede, The Netherlands. His research interests are on designing optimal sampling schemes using remote sensing images.

    Jaap de Gruijter (Alterra, Wageningen University and Research Centre)
    Geostatistical classification of agricultural fields on the basis of nitrate leaching

    Leaching of nitrates from soils used for agriculture to the groundwater is one of the major public health issues in The Netherlands. The government policy is to limit nitrate emission to the groundwater by regulating the nitrogen application by individual farmers. As sandy soils with low groundwater tables are more susceptible to nitrate leaching than other soils, the government decided to set a special, more severe limitation for nitrogen application on dry sandy soils. As this extra limitation has substancial negative financial consequences for the farmers in question, the mapping of these susceptible soils is a soil survey task linked to a hot political dossier. As the extra nitrogen limitation applies to agricultural parcels (management units), the problem is to indicate which parcels are on dry sandy soil. Two questions have to be resolved.
    Firstly, how are “parcels on dry sandy soil” to be defined? Research on leaching led to the following definition. A susceptible soil has texture-class ‘sand’, Mean Highest Groundwater table (MHG) deeper than 60 cm below surface, and Mean Lowest Groundwater table (MLG) deeper than 120 cm. This definition applies to a point in the field, but properties vary within parcels. Therefore a susceptible parcel was defined as having at least 2/3 of its area on susceptible soil.
    Secondly, how can we classify parcels as ‘susceptible’ or ‘not susceptible’? The available data are: (a) observations at sample points on texture, MHG and MLG, (b) spatially exhaustive maps of ancillary variables such as soil type, altitude and drainage class. To this end we developped a geostatistical method, referred to as ‘conditional Gaussian co-simulation with uncertain trends’. The steps are as follows.
    (1) Develop a multiple linear regression model to predict MHG from the available ancillary variables, and similarly a model for MLG. These models represent the (bivariate) trend.
    (2) Fit variogram models to the calculated MHG and MLW regression residuals, representing the spatial autocorrelation structure of the residuals.
    (3) Generate a large number (300) of pairs of correlated MHW and MLW fields (values arranged on a fine grid) by Monte Carlo simulation, conditional on the data at the sample points, using the variograms and the trend models. The variation within and between the generated fields represents the uncertainty about regression residuals between sample points, as well as the uncertainty about the regression parameters.
    (4) Post-process the simulation results by determining, for each parcel separately, the frequency by which the pairs of fields meet the classification criteria for ‘susceptible’. Classify the parcel as ‘susceptible’ if this frequency is at least 95%, a confidence level chosen by the government in order to balance the farmer’s financial risk of false ‘susceptible’ classification against the ecological risks of false ‘not susceptible’ classification.
    As this method is computationally very demanding, special attention was given to the number of simulations needed in step 3. The method was succesfully tested in a number of test areas and is now being applied on a routine basis.

    Jaap de Gruyter werkt bij het Centrum Bodem van Alterra, onderzoeksinstituut voor het landelijk gebied, en onderdeel van Wageningen Universiteit & Research Centre (WUR). Hij houdt zich bezig met statistische methoden voor ruimtelijke inventarisatie en monitoring van natuurlijke hulpbronnen zoals bodem, grondwater, vegetatie, bos en landschap.

    Emmanuel Lesaffre (Biostatistical Centre, KU Leuven)
    Correcting for misclassification in caries research

    In large epidemiological (dental) studies often several observers are involved. It is customary to highlight the between-observer agreement with a kappa-statistic. However, the kappa-statistic cannot distinguish between variability and bias in scoring. When a benchmark scorer is available it is preferred to report sensitivity and specificity of the examiners with respect to the benchmark scorer. However, best is to correct for misclassification. We will discuss the issue of correcting for misclassification in models for count data. Frequentist and Bayesian approaches can be used. The approaches are applied to the first year’s data of a dental longitudinal study (Signal Tandmobiel

    Emmanuel Lesaffre works at the Biostatistical Centre from the Katholic University of Leuven. His research focusses on clinical trials, repeated measurements, survival analysis, statistics in dentistry, en meta-analyses.


    Inlichtingen zijn verkrijgbaar bij Paul Eilers (p.eilers@lumc.nl)

    De Jubileumcommissie
    Paul Eilers, Berrie Zielman en Mark de Rooij

    VOC-voorjaarsbijeenkomst 2004

    9-11 maart, Universiteit van Dortmund

    Dit jaar wordt de voorjaarsbijeenkomst in samenwerking met de Duitse zustervereniging, de GfKl, Gesellschaft für Klassification, georganiseerd in Dortmund. Verdere info over het programma kun je vinden op http://www.statistik.uni-dortmund.de/GfKl2004/index.html.

    Programma overzicht - VOC sessies

    • 9 maart 9.30 W.J. Heiser
      Fundamental role of row-tocolumn distances and shadow points in correspondence analysis
    • 9 maart 11.15 Mixture Modelling sessie (Jeroen Vermunt)
      Sprekers: Magidson, Dias en Vermunt.
    • 9 maart 16.15 Optimal Scaling sessie (Anita van der Kooij & Elise Dusseldorp)
      Sprekers: Van der Kooij, Manisera, Dusseldorp, Van der Leeden.
    • 10 maart 9.45 Henk Kiers
      Bootstrap confidence intervals for three-way component methods.
    • 10 maart 13.30 Multiway Methods sessie (Henk Kiers)
      Sprekers: Krolak-Schwerdt, Van Mechelen, Smilde en Ceulemans.
    • 10 maart 15.35 Psychometrics sessie (Paul De Boeck)
      Sprekers: Berger, Steyer, Kamphuis en Rijmen.

    Abstracts of the VOC sessions during the joint meeting of the Gfkl and the VOC

    Willem J. Heiser (Leiden Universiteit)
    The fundamental role of row-to-column distances and shadow points in correspondence analysis
    Correspondence analysis can be described as a method to approximate chi-squared distances among either the row- or the column profiles of a contingency table. A powerful way to study specific associations between row- and column elements is by plot-ting the correpondence analysis results in a simultaneous display, or joint plot. However, whether or not row-to-column distances in a joint plot can be trustfully inter-preted has been a matter of debate. The conventional view is that we can scale the coordinates in such a way that either the row-to-row distances can be interpreted, or the column-to-column distances, but never directly the row-to-column distances (Heiser and Meulman, 1981; Greenacre and Hastie, 1987). Carroll, Green and Schaffer (1986) proposed an alternative scaling of the coordinates of a joint plot for which they claimed that a full distance interpretation is possible, but Greenacre (1989) has shown that this claim is not warranted. Apart from any dimension reduction, the representation of the data in correspondence analysis is a barycentric configuration of profile points with respect to unit profiles, which are hypothetical profiles for which all mass is concentrated in one cell. The paper shows that row-to-column distance interpretations are possible in any joint plot that preserves barycentricity. The distance involved is not of the chi-squared type, but simply Euclidean. In addition, a new type of supplementary point is introduced, called the shadow point, which allows the development of a simple formula for the reconstruction of the data in terms of row-to-column distances. These results are equally valid in the full-dimensional space as in a reduced space obtained by projection, or by any other method producing a suitable configuration of the unit profiles.

    José G. Dias (Departamento de Métodos Quantitativos, Instituto Superior de Ciências do Trabalho e da Empresa, ISCTE-UNIDE)
    On Measuring the Uncertainty of Classification from Model-Based Clustering Procedures
    The paper discusses measures of uncertainty of classification from model-based clustering procedures. In classification problems (supervised learning) the label of observations are known, and the goal is to learn the classification rule. In clustering problems (unsupervised learning), the label of objects (and the number of labels) is unknown. From model-based clustering procedures such as finite mixture models, one can obtain the posterior probability that each observation is generated by a given component or cluster. A further step is to apply an optimal Bayes rule that transforms this type of soft partition of observations into a hard partition, usually used as input for further analyses (cluster profiling, etc.) However, little is known about the uncertainty of the mapping from [0, 1] to {0,1} operated by the classification rule. The paper explores measures of the uncertainty of classification using resampling techniques. Further illustrations of the framework using finite mixtures of conditionally independent Bernoulli/multinomial distributions (latent class model) are provided using synthetic and empirical data. Related topics such as entropic measures, label-switching strategies, and level of separation of components are discussed in the paper as well.

    Jay Magidson (Statistical Innovations Inc., USA)
    Using CHAID to Profile Latent Segments when the Segmentation is Based on Multiple Criteria
    The CHAID algorithm has proven to be an effective approach for obtaining a quick but meaningful segmentation based on a single categorical criterion variable such as response to a mailing (i.e., RESPONSE = {Responder, Non-responder}). Latent class (LC) models have been increasingly used to develop segments based on multiple response indicators such as the selections made from each of M sets of alternatives in discrete choice studies. In this paper, we propose an efficient hybrid methodology that combines these two techniques to obtain a new extended CHAID algorithm to segment based on multiple criteria. The first step in the hybrid method is to use the LC model to decompose the set of multiple criteria into a single set of K underlying latent segments, and to estimate posterior membership probabilities for each of these segments for each case. A modified CHAID algorithm is then used to profile these segments as a function of demographic or other exogenous variables. The posterior membership probabilities are used as fixed weights in this profiling analysis to eliminate bias due to the misclassification error that occurs if cases were equated (with probability one) to that segment having the highest posterior probability. The new algorithm also incorporates sampling weights, if present, using an efficient ML algorithm proposed by Vermunt and Magidson (2001). The new hybrid method is illustrated on discrete choice data previously analyzed using a LC choice model. Several demographic variables were available for post hoc profiling of the segments. We show that the extended CHAID approach provides useful output that supplements the traditional profiling output obtained when the demographics are included as active or inactive covariates in the LC model. The CHAID-type output simplifies the process of examining the relationship between the segments and the covariates by 1) ranking the covariates from most to least significant and 2) for each covariate, combining categories that are not significantly different. This new output is shown to be especially valuable when the number of covariates is large. The new algorithm has been implemented in a commercially available computer program called CHAIDTM, and works in conjunction with the latent class programs Latent GOLD 4.0 and Latent GOLD Choice 4.0.

    Jeroen K. Vermunt (Tilburg University Tilburg, The Netherlands)
    Hierarchical Mixture Models for Nested Data Structures
    In social science research, but also in research in other fields, we are often confronted with nested or hierarchical data structures. Examples are data from employees belonging to the same organizations, individuals living in the same regions, customers of the same stores, repeated measures taken from the same individuals, and individuals belonging to the same primary sampling units in two-stage cluster samples. In this paper, I present an extension of the standard mixture model that takes into account the hierarchical structure of the data. Introducing random-effects in the model of interest is one of the ways to deal with such dependent observations. It is well known that the finite mixture model itself is, in fact, a non-parametric random-effects model. The proposed solution involves introducing non-parametric random effects in a finite mixture model. This yields a model with a separate finite mixture distribution at each level of nesting. When using the hierarchical mixture model for clustering, one will not only obtain a clustering of lower-level units, but also a clustering of higher-level units. The clusters of higher-level units differ with respect to the sizes of the lower-lever clusters. This is similar to what is done in multiple-group latent class analysis, with the difference that groups are assumed to belong to a small number of clusters (latent classes) instead of estimating a separate latent class distribution for each group. The hierarchical mixture model can also be used in the context of mixture -- or latent class -- regression analysis. As in clustering applications, one may assume that the latent class distribution differs across higher-level clusters. Another option is to assume that regression coefficients differ not only acrosslower-level clusters, but also across higher-level clusters. The latter specification yields a nonparametric three-level regression model. Because it is not practical to estimate the hierarchical mixture using a standard EM algorithm, I propose a variant of EM called upward-downward algorithm. This method makes use the tree structure of the underlying graphical model for an efficient implementation of the E step.

    Elise Dusseldorp and Anita J. van der Kooij (Data Theory Group, Department of Education, Leiden University)
    Combining regression analysis with optimal scaling and regression trees to estimate interaction effects
    Regression analysis with optimal scaling, also referred to as CATREG (Gifi, 1990; Meulman, Heiser, and SPSS, 1999), is especially suitable to model nonlinear relationships between a set of predictor variables and one outcome variable. In this paper we focus on regression problems with a numerical outcome variable and several nominal predictor variables. To model interaction effects between the predictor variables, we combine CATREG with another analysis technique, that is, regression trees (CART; Breiman et al., 1984). In the first step of the analysis, a main effects model is estimated using CATREG. In the second step, a regression tree is fitted on the residuals of the main effects model. A cross-validation procedure on the whole process is used to trace a stable small regression tree, referred to as a regression trunk (Dusseldorp and Meulman, 2001). In the final step, the regression trunk is added to the main effects model as a nominal predictor variable. The size and significance of the interaction effect is estimated. The results of the above approach are compared to the results of analysis of variance. Advantages of our approach are that (higher order) interaction effects are detected easily, and the interpretation of the effects is straightforward.
    Breiman, L. Friedman, J.H., Olshen, R.A., and Stone, C.J. (1984): Classification and Regression Trees. Wadsworth, Belmont, CA.
    Dusseldorp, E. and Meulman, J.J. (2001): Prediction in medicine by integrating regression trees into regression analysis with optimal scaling. Methods of Information in Medicine, 40, 403- 409.
    Gifi, A. (1990): Nonlinear multivariate analysis. Wiley, Chichester.
    Meulman, J.J., Heiser, W.J., and SPSS (1999): SPSS Categories 10.0. SPSS Inc., Chicago.

    Marica Manisera, Elise Dusseldorp, and Anita J. van der Kooij (Department of Quantitative Methods, University of Brescia and University of Milano- Bicocca, Italy, and Data Theory Group, Department of Education, Leiden University, The Netherlands)
    Scale Construction for Job Satisfaction by Categorical Principal Component Analysis
    The aim of this study is to construct one or more scales for job satisfaction, by means of Categorical Principal Component Analysis (Gifi, 1990), implemented in the Categories module of SPSS (Meulman, Heiser, and SPSS, 1999). CATPCA simultaneously turns categorical variables into quantitative variables using optimal scaling and reduces the dimensionality of the data. The optimal scaling process is a very general approach to treat multivariate categorical data. We use a dataset resulting from a survey that involves 2066 workers from 220 organizations from the Italian social service sector (see for details about the survey, Borzaga, 2000). This survey includes 13 items referring to different aspects of job satisfaction. These items are ordinal variables that are analyzed by CATPCA using monotonic (spline) transformations. From the CATPCA solution, we extract multiple scales reflecting different aspects of job satisfaction, and establish the reliability of these scales. A confirmatory analysis of the final solution will be conducted with EQS (Structural Equation Modeling) (Bentler and Weeks, 1980; Bentler and Wu, 2002) in order to investigate if the different organizations reflect the same structure of job satisfaction as the one obtained by CATPCA.
    Bentler, P.M. and Weeks, D. G. (1980): Linear structural equation with latent variables. Psychometrika, 45, 289--308.
    Bentler, P.M. and Wu, E.J.C. (2002): EQS 6 for windows user's guide. Multivariate Software, Encino, CA.
    Borzaga, C. (2000): Capitale umano e qualità del lavoro nei servizi sociali. [Human capital and job quality in social services]. FIVOL, Roma, Italy.
    Gifi, A. (1990): Nonlinear multivariate analysis. Wiley, Chichester.
    Meulman, J.J., Heiser, W.J., and SPSS (1999): SPSS Categories 10.0. SPSS Inc., Chicago.

    Anita J. van der Kooij, Mariëlle Linting, and Jacqueline J. Meulman (Data Theory Group, Department of Education, Leiden University, The Netherlands)
    Optimal Scaling and bootstrapping
    Optimal scaling techniques optimally transform variables with mixed measurement levels (Gifi, 1990). The transformations are optimal with respect to the particular model that is fitted, and for the particular data set the model is fitted to. By allowing for nonlinear (monotonic or nonmonotonic) transformations, optimal scaling finds optimal quantifications of categorical variables. The form of the transformation depends upon the scaling level, that can be chosen for each variable separately. The scaling level for a variable can be chosen according to the measurement level of the variable, or according to the kind of information in the variable the researcher chooses to retain in the quantified variable. Two optimal scaling programs, that are available in the Categories module of SPSS (Meulman, Heiser, and SPSS, 1999), are CATPCA (categorical principal components analysis) and CATREG (categorical multiple regression). Both programs provide monotonic (spline) and nonmonotic (spline) transformations. Although CATPCA and CATREG have been called exploratory techniques, they need not be deprived of confirmatory diagnostics. In this paper we will show the use of the nonparametric bootstrap (Efron, 1982; Efron and Tibshirani, 1993) with CATPCA to obtain stability measures of the transformations, and present a method for representing confidence regions for the transformations. We will also study the optimality of different transformation types obtained with CATREG in terms of prediction of future responses. To estimate the prediction error we use the .632 bootstrap estimator (Efron and Tibshirani, 1993).
    Efron, B. (1982): The jackknife, the bootstrap, and other resampling plans. Society for Industrial and Applied Mathematics Philadelphia, Philadelphia.
    Efron, B. and Tibshirani, R.J. (1993): An introduction to the bootstrap. Chapman & Hall, New York.
    Gifi, A. (1990): Nonlinear multivariate analysis. Wiley, Chichester.
    Meulman, J.J., Heiser, W.J., & SPSS. (1999). SPSS Categories 10.0. SPSS Inc., Chicago.

    Rien van der Leeden, Marike Polak, Renske Doorenspleet (Leiden University, The Netherlands)
    Empirical Scaling of Democracy
    Problems in scale construction arise in a variety of social research areas. For a long time, the measurement of democracy, and the development of a democracy scale positioning independent states relative to each other, has been a major concern in political science. Although there are a number of democracy indices known from literature, the most widely used measure has been developed within the Polity Project (Marshall and Jaggers, 2002 [Polity IV]). Despite the fact that Gurr's scale is used in numerous studies on comparative politics and international relations, until now researchers have paid hardly any attention to the quality of the scale values it provides. Methodological issues related to Gurr's scale were only recently addressed by Munck and Verkuilen (2002), who state that this scale has at least been used too uncritically. Gurr's scale is based on five indicator variables. Thorough inspection of the scale and the procedure leading to scale values shows several problems and ambiguities. These include the use of a non-empirical coding and weighting scheme, using subjective and partly arbitrary category codes, and an assumed unidimensionality of the set of indicator variables. So far, these issues have not been explained, discussed, or defended by the researchers of the Polity Project. In this paper we present an alternative scale of democracy based on an empirical analysis of the Polity IV indicator variables. The SPSS program MCORRESPONDENCE, that is, multiple correspondence analysis (which also goes under the name of HOMALS), was applied for scale construction. Gurr's scale values are compared with the MCORRESPONDENCE results in terms of their quantitative relationship, and the surplus value of the MCORRESPONDENCE scale values in terms of interpretation is examined.
    Marshall, M.G. and Jaggers, K. (2002). Political regime characteristics and transitions, 1800-2000. Polity Project Website. Center for International Development and Conflict Management, University of Maryland, College Park.
    Munck, G.L., and Verkuilen, J. (2002). Conceptualizing and measuring democracy: evaluating alternative indices. Comparative Political Studies, 35(1), 5-34.

    Henk Kiers (Heymans Institute, University of Groningen)
    Bootstrap Confidence Intervals for Three-way Component Methods
    The two most common methods for the analysis of threeway data, CANDECOMP/ PARAFAC and Tucker3 analysis, are used to summarize a three-mode three-way data set by means of a number of component matrices, and, in case of Tucker3, a core array. Almost always the analyses are applied to data pertaining to a sample from a larger population, and usually, the results for the sample are assumed to be, at least to some extent, generalizable to the population from which the sample was drawn. In the practice of three-way analysis, the generalizability issue is usually dealt with by means of cross-validation or by means of split-half comparisons. However, neither procedure gives concrete estimates of the uncertainties (due to sampling fluctuations) of our solutions. Here, it will be discussed how such uncertainty estimates, in the form of confidence intervals can be obtained. For this purpose the bootstrap will be used (see Efron & Tibshirani, 1993). Having chosen to define confidence intervals by means of the bootstrap is only the first step in the process of obtaining such uncertainty estimates. At least the following issues will be dealt with:
    • How to deal with the transformational nonuniqueness of the three-way methods? Several possibilities emerge. These will be described, and relative advantages and disadvantages will be discussed.
    • How well does the bootstrap-procedure perform? Does the coverage in practice agree with the nominal coverage (of e.g., 95%)? An answer to these questions was obtained through an extensive simulation study, which will be described here.
    • How can computations remain feasible?
    A simple way of obtaining a fast procedure is to use the original solution as starting configuration for each bootstrap analysis. However, this may affect the coverage of the bootstrap intervals. In a simulation study this fast procedure is compared to theoretically better (but slower) procedures, both in terms of speed, and coverage.

    Sabine Krolak-Schwerdt (Department of Psychology, Saarland University, Germany)
    A Three-Way Multidimensional Scaling Approach to the Analysis of Person Memory and Judgement Structures
    The cognitive organization of person attributes may depend on (1) how coherently the attributes are linked within the stimulus person and (2) how strongly they activate a social stereotype. These two factors determine the number of dimensions in the representation, their salience and their relatedness. To analyze the simultaneous representation of coherence and stereotypicality, a three-way multidimensional scaling model is presented that measures the three dimensional parameters and their change across stimulus conditions. The model constructs basically an attribute space which is common to all conditions. The model allows for two kinds of distortions which may be specific to certain stimulus conditions: (a) differentially weighing of dimensions of the common space and (b) differential rotations of the space. An experiment investigated the validity of the model. The model showed an excellent statistical fit to the empirical data. Furthermore, the parameters of multidimensional person memory and judgement structures were sensitive to manipulations of coherence and stereotypicality. The results show that (1) both experimental factors reduce dimensionality of the representation by inducing illusory correlations between judgement dimensions and (2) coherence and stereotypicality complement one another.

    Iven Van Mechelen (Leuven University)
    N-way Hierarchical Classes Models: State of the Art and Ongoing Developments

    Hierarchical classes (HICLAS) models constitute a distinct family of classification models for N-way N-mode data that imply a simultaneous clustering of each of the modes in the data. In this paper, I will present a state of the art of research on the HICLAS family, together with an overview of novel, ongoing developments. The original hierarchical classes model for binary two-way two-mode data as proposed by De Boeck and Rosenberg (1988) will serve as a starting point. Extensions to N-way N-mode data and data that are real-valued rather than binary will be outlined. Ongoing developments will be shown to include various types of restricted models (both in terms of internal constraints and in terms of external covariate information), and different kinds of model expansions. Links with several other simultaneous clustering models will be pointed at.
    De Boeck, P. and Rosenberg, S. (1988): Hierarchical Classes: Model and Data Analysis. Psychometrika, 53, 361--381.
    Ceulemans, E., Van Mechelen, I. and Leenen, I. (in press): Tucker3 Hierarchical Classes Analysis. Psychometrika.
    Van Mechelen, I., Bock, H.H. and De Boeck, P. (in press): Two-mode Clustering Methods: A Structured Overview. Statistical Methods in Medical Research.
    Van Mechelen I., Lombardi, L. and Ceulemans, E. (in press): Hierarchical Classes Modeling of Rating Data. Psychometrika.

    Age K. Smilde, e.a. (Biosystems Data Analysis, University of Amsterdam and TNO Nutrition and Food Research, The Netherlands)
    Multiset Methods for Longitudinal Metabolomics Data
    Metabolomics is a technique that enables quantification and qualitative analysis of metabolites in biological fluids. There is an increasing awareness in the biology community that time-resolved metabolomics measurements contain important information regarding biological organisms. This is, obviously, related to the dynamic nature of organisms resulting in biorhythms. Such biorhythms can be disturbed by external causes (e.g. drug intake, food intake) or internal causes (e.g. developing diseases). Such disturbances affect the metabolism of organisms and are expected to show up in properly measured longitudinal metabolomics data (e.g. in the urine or blood of the organism). Longitudinal metabolomics analysis can serve several goals. In normality studies, the goal is to establish biorhythms under homeostasis which serve as a reference point to detect future deviating dynamic behavior. Another goal is to detect early biomarkers for developing diseases; this calls for models based upon which biomarker selection can take place. Yet another goal is to model the dynamic response of an organism to external stress which gives insight in the way such an external stress influences the system. All these goals require a different data analysis method. The type of method depends also on the set-up of the metabolomics data set. An overview will be given of different longitudinal modeling strategies for metabolomics data. These methods are based on three-way analysis and (multilevel) simultaneous component analysis. Examples will be given using i) a longitudinal normality study of monkey urine and ii) a longitudinal metabolomic study with urine of guinea pigs developing osteoarthritis during aging. The strengths and limitations of the methods will be illustrated with these example studies.

    Eva Ceulemans (Department of Psychology, Katholieke Universiteit Leuven, Belgium)
    Three-way modeling of individual differences in sequential personality-related processes
    In this paper, we focus on the modeling of a specific type of three-way three-mode binary data that often occurs in personality psychology, the modes of which consist of (1) persons, (2) situations, and (3) mediating cognitiveaffective variables as well as behaviors. Underlying such data, personality psychologists typically assume a twostep sequential process
    situation >> CAV >> behavior
    the two steps of which may be characterized in terms of if-then links. It is further hypothesized that these two types of if-then links may differ across persons. An important challenge for personality psychology then consists of retrieving the place and the nature of the key individual differences in the process under study. To meet this challenge, we present a new three-way three-mode model that belongs to the family of Tucker-HICLAS models. The latter is a family of multiway classification models for binary data that constitute the Boolean counterparts of Tucker models for real-valued data. The new Tucker-HICLAS model includes two core arrays that represent the two types of if-then links, as mentioned above, as well as individual differences therein.

    Martijn P.F. Berger (University of Maastricht Department of Methodology and Statistics, The Netherlands)
    Robust Designs for Time-Structured Data
    In health sciences, medicine and social sciences linear mixed effects (LMM) models are often used to analyse longitudinal data. The search for optimal designs for these models is often hampered by two problems. The first problem is that these designs are only locally optimal. The second problem is that an optimal design for one model may not be optimal for other models. In this paper the maximin principle is adopted to handle both problems, simultaneously. The maximin criterion is formulated by means of a relative efficiency measure, which gives an indication of how much efficiency is lost when the uncertainty about the models over a prior domain of parameters is taken into account. The procedure is illustrated by means growth studies. It is shown that for the mixed effects polynomial models applied to these studies, the maximin designs remain highly efficient for different sets of models and combinations of parameter values.

    Frans Kamphuis (CITO National Institute for Educational Measurement, The Netherlands)
    Methodological Aspects of a Student Monitoring System
    The monitoring system consists of a coherent set of tests for longitudinal assessment of a student's achievement throughout primary education as well as a system for manual or automated registration of student's progress. Primary education consists of eight grades. Usually twice a year an achievement test is taken for subject components of language, mathematics and environmental studies. The results of the successive assessments are converted into a fixed scale for each of the subjects with the help of which a student's progress can be monitored over a number of years. This continuity in the collection of data is of great importance for an early recognition/identification of any problems. In this way the monitoring system complements the impressions that the teacher has of the student on the basis of day-to-day progress assessment of the pupil. Moreover, the nationally standardized scales of the monitoring system make it possible to widen one's view beyond the classroom or the school. Thus results of the students can be compared nationally with those of other children, for example children of the same age group or educational method. Furthermore, by choosing a suitable growth model, it is possible on the basis of student's results to make measurements better and more precise and to predict future results. In the paper special attention is given to the integration of the measurements (item response theory based) into the growth.

    Frank Rijmen, Paul De Boeck, and Han L.J. van der Maas (Onderzoeksgroep HCIV, K.U.Leuven, Belgium and University of Amsterdam, The Netherlands)
    An IRT Model with a Parameter-Driven Process for Change
    An IRT model for binary longitudinal data is presented. The heterogeneity between persons is taken into account by a continuous latent variable, as in common IRT models. Autodependencies are accounted for by assuming within-subject variability with respect to the parameters of the IRT model. More in particular, the parameters of the IRT model are governed by an unobserved or hidden homogeneous Markov process. The model includes the mixture linear logistic test model (Mislevy & Verhelst, 1990), the mixture Rasch model (Rost, 1990), and the Saltus model (Wilson, 1989) as specific instantiations. The model is applied to a longitudinal experiment on discontinuity in conservation acquisition (van der Maas, 1993).
    Mislevy, R.J., and Verhelst, N. (1990). Modeling item responses when different persons employ different solution strategies. Psychometrika, 55, 195-215.
    Rost, J. (1990). Rasch models in latent classes: An integration of two approaches to item analysis. Applied Psychological Measurement, 14, 271-282.
    Van der Maas, H. (1993). Catastrophe analysis of stagewise cognitive development: Model, method and applications. Unpublished doctoral dissertation, University of Amsterdam.
    Wilson, M.(1989). Saltus: A psychometric model of discontinuity in cognitive development. Psychological Bulletin, 105, 276-289.

    Rolf Steyer (Friedrich-Schiller-University Jena Institute of Psychology, Germany)
    How to get it all: Average and individual causal effects, and why individuals differ in their effects. Design and Data analysis
    A design and a method of data analysis is presented which yield not only (a) estimates of the average causal effect of a treatment variable on a response variable in the sense of Rubin's approach to causality, but also (b) estimates of the variance of the individual causal effects and (c) of the covariance between pretest and individual causal effects. It is shown how to include variables in the analysis that (d) explain the interindividual differences in the individual causal effects of the treatment variable on the response variable. All this is based on a specific design with random assignment of units to the treatment conditions, assessing a pretest and introducing some additional assumptions which, however, can be tested in the analysis as well. An example will illustrate this new method.

    Paul H. C. Eilers and Martien W. Borgdorff (Department of Medical Statistics, Leiden University Medical Center and KNCV Tuberculosis Foundation, The Hague)
    Non-parametric Log-concave Mixtures Finite mixtures of parametric distributions have been studied extensively. Smoothing, i.e. non-parametric estimation, of distributions is also a well-developed field. It seems natural to combine the two, but this is not without problems. Most non-parametric density estimators have too much freedom, leading to identifiability problems for the components of the mixture. An effective solution is to constrain the shapes of the non-parametric distributions by forcing them to be log-concave. This can be implemented easily with penalized likelihood. Increasing a third-order diference penalty pushes the fit gently in the direction of the normal distribution, thereby encouraging log-concaveness. An interesting property is that mean and variance of the smooth distribution are the same as those of the raw distribution, for any value of the weight of the penalty. This is not the case with other smoothers, like kernels or local likelihood. We can use this log-concave smoother in the familiar ``split and fit" EM algorithm for mixture estimation: split the data into two (or more) groups, using membership weights, apply the smoother to each group, and compute new membership weights as relative probabilities of the observations for the next ``split". This is repeated till convergence. The algorithm has been applied successfully to data sets with two or three mixture components. An important application is the estimation of prevalence of Tuberculosis from population surveys. We also discuss extensions to (time) series of distributions, where the components are kept the same, but the mixing proportions are allowed to vary (gradually).



    VOC-najaarsbijeenkomst 2003

    Vrijdag 21 november 2003, Leiden

    Er is een zeer gevarieerd programma met als thema Item Response Theorie modellen en toepassingen daarvan op onderwerpen zoals kwaliteit van leven, mobiliteit, etc.

    Plaats: Leiden Universiteit Medisch Centrum, Collegezaal, Gebouw fysiologie

    Programma

    • 10:30-11:00 Welkom en koffie
    • 11:00-11:35 Stef van Buuren (TNO-PG) Verbeteren van vergelijkbaarheid met Response Conversie
    • 11:35-12:10 Frank Rijmen (Leuven University) Mixed models in Item Response Theory
    • 12:10-12:45 Rebecca Holman (Academic Medical Centre, Amsterdam) The AMC Linear Disability Score project: using item response theory to construct and calibrate an item bank to measure the ability to perform activities of daily life
    • 12:20-14:00 Lunch in de Posthof
    • 14:00-14:35 Martijn Berger (Maastricht University) Optimal design in educational testing: a review
    • 14:35-15:10 Andries van der Ark (Tilburg University) Nonparametric item response theory
    • 15:10-15:45 Thee
    • 15:15-16:45 Norman Verhelst (Cito) Item Response Theory and Multiple Choice Questions
    • 16:45 Borrel

    Toegang is vrij voor iedere geïnteresseerde. Aanmelding graag voor maandag 17 november bij Marieke Timmerman, email: M.E.Timmerman@ppsw.rug.nl. Bij opgave vermelden of deelname aan de lunch (9 euro) gewenst is.

    Abstracts

    Stef van Buuren (TNO-Preventie en Gezondheid)

    Verbeteren van vergelijkbaarheid met Response Conversie

    De Europese Unie telt binnenkort 25 lidstaten, elk met een eigen informatievoorziening. In opdracht van de Europese Commissie heeft TNO-PG een methode ontwikkeld waarmee de vergelijkbaarheid van informatie uit verschillende lidstaten kan worden verbeterd. Basisgedachte is dat gelijksoortige informatie naar een gemeenschappelijke schaal vertaald kan worden, zodat vergelijkingen tussen landen op de gezamenlijke schaal mogelijk worden. Dit proces gaat in twee stappen. Stap 1 is de constructie van conversiesleutel, stap 2 betreft de werkelijke conversie naar de gezamenlijke schaal. Een doorslaggevend voordeel van de methode is dat ze nieuwe vergelijkende analyses mogelijk maakt op basis van bestaande data. De methode is een min of meer rechtstreekse toepassing van Item Response Theorie. De structuur van de gegevens brengt echter enkele specifieke complicaties met zich mee die zich niet, of in mindere mate, voordoen in meer klassieke toepassingen van IRT. De presentatie geeft voorbeelden van de methode, en gaat in op nog openstaande vragen.

    Stef van Buuren is Hoofd Statistiek van TNO Preventie en Gezondheid. Hij houdt zich bezig met o.m. de groei van Nederlandse kinderen, multipele imputatie van incomplete data, vergelijkbaarheid, simulatiesystemen, en publiceert daarover in zowel de toegepaste als de statistische literatuur.


    Frank Rijmen (Leuven University)

    Mixed models in Item Response Theory

    Test data are clustered: Each tested person corresponds to a cluster, and the responses of a person on the individual items of the test correspond to the observations within a cluster. Typically, the observations within a cluster tend to be more homogeneous than the observations stemming from different clusters, meaning that the observations within a cluster are not statistically independent. Mixed models are a collection of statistical tools that are well suited for analyzing clustered data. The heterogeneity between clusters can be taken into account by assuming that (some of) the parameters of the model follow a random distribution over the population of clusters. Hence, (some of) the parameters of the model are random variables, and the model is a random effects or mixed model. The random effects represent the cluster-specific effects. Mixed models and related methods were first developed in the context of analysis of variance and regression analysis, leading to the linear mixed model. Other commonly used terms are multilevel models, hierarchical models, and random coefficient models. More recently, generalized linear and nonlinear mixed models were also developed.

    Most parametric IRT models can be conceptualized of as generalized linear and nonlinear mixed models. For example, the linear logistic test model (Fischer, 1973), is a logistic regression model with a random intercept. There are four important assets of this conceptualization. First, the mixed model framework for IRT models relates IRT to a broad statistical literature. Second, applying a common framework to different IRT models can help in our understanding of their differences and commonalities. Third, standard IRT models can readily be adapted and extended, so that a researcher can build his own model, customized to a specific scientific question or data set. Finally, existing and newly formulated models can be estimated using software for generalized linear and nonlinear mixed models.

    Frank Rijmen obtained his PhD in psychometrics at the K.U. Leuven in 2002. Now he is a postdoctoral researcher at the same university. His main research interest is in the field of item response theory, with special attention to the relation between parametric IRT models and generalized linear and non-linear mixed models.


    Rebecca Holman (Amsterdam Medical Centre)

    The AMC Linear Disability Score project: using item response theory to construct and calibrate an item bank to measure the ability to perform activities of daily life

    An important aspect of quality of life is the ‘disability’ status of patients. This is often described in terms of their ability to carry out ‘activities of daily life’ and measured using questionnaires, which grade each patient on whether they are able to perform certain activities. A questionnaire, which contains many items, can cost patients, clinicians and researchers an excessive amount of time to complete. In the framework of item response theory (IRT) and in conjunction with an item bank, it is possible to use computerized adaptive testing procedures to assess functional status by presenting each individual patient with a smaller selection of items than is possible using traditional methods. This has awakened the interest of medical and epidemiological researchers in the use of IRT as a tool to analyse this type of questionnaire. The AMC Linear Disability Score project aims to develop an item bank containing items, designed to quantify the ability of chronically ill patients to perform ‘activities of daily life’. The item bank currently contains approximately 200 items, each describing an activity of daily life connected with self-care, mobility and household management. This paper will describe the stages involved in constructing and calibrating this item bank and the statistical and methodological challenges encountered in the process. In addition, an insight into the applications and uses of the item bank will be given.

    Rebecca Holman is employed as a statistician by the Department of Clinical Epidemiology and Biostatistics at the Academic Medical Center, Amsterdam, The Netherlands. Her work concentrates on the statistical and methodological challenges encountered when item response theory techniques are used to analyse data resulting from questionnaires to quantify health status, often in terms of the ability to perform activities of everyday life.


    Martijn Berger (Maastricht University)

    Optimal design in educational testing: a review

    Modern testing in Education and Psychology is mostly based in Item Response Theory (IRT), and several different IRT models have been proposed to analyse the test data. As in any research situation, where data are obtained, the costs of collecting test data may be enormous. Both the required sample size for efficient parameter estimation and the costs of producing test items can be reduced by using optimal design techniques. Optimal design theory was first applied to design issues in testing about 30 years ago.

    Basically there are two distinct design issues in testing. The first issue is concerned with the calibration of pre-production test items. Item calibration takes place by administering test items to a selected sample of test takers, simultaneously or sequentially. Since large samples are usually needed to obtain efficient item parameter estimators, it is worthwhile to try to select an optimal sample of test takers, which is preferably much smaller in size. The second design issue is the problem of how to pick the test items for a given test taker or a sample of test takers for efficient trait estimation. Both issues have been tackled by methods originating from optimal design theory.

    In this paper a review of the optimal design methods and procedures will be presented. First, IRT models for both dichotomous and polytomous models will be briefly described. Then the optimal design problem for the two design issues (item calibration and trait estimation) will be described together with the problems arising from practical and technical constraints. Optimal design methods have been applied to both computerized adaptive testing and paper and pencil testing in class room settings. A bibliography will be provided.

    Martijn P.F. Berger has been involved in optimal design research, not only for testing with IRT models, but also for generalized linear mixed models (GLMM), with emphasis on longitudinal designs. Recent research focuses on maximin procedures, which have been applied to the nominal response model, to the GLMM models and to the robustness of design problem.


    Andries van der Ark (Tilburg University)

    Nonparametric item response theory

    The aim of nonparametric item response theory (NIRT) is to order respondents on a latent variable (called the latent trait) using the ordinal scores on several manifest variables (called items). In this presentation, I will first explain the basic ideas of NIRT for dichotomous item scores. Second, I will present an application where NIRT is used for the construction of a psychological test measuring crying behavior. In this test, the unweighted total score is used as an ordinal estimate of the latent trait. The analyses were conducted using the software program MSP. It is shown how in subsequent analyses quantitative estimates of the latent trait may be obtained Third, I will discuss some open problems that are encountered when NIRT is extended from dichotomous items to polytomous items.

    Andries van der Ark is an assistant professor at the Department of Methodology and Statistics of the Faculty of Social and Behavioral Sciences at Tilburg University. His research interest is the modelling of test and questionnaire data. He is currently working on nonparametric item response theory and the analysis of missing data.


    Norman Verhelst (Cito)

    Item Response Theory and Multiple-Choice Questions

    One of the main controversies between American (U.S) and European (and Australian) students of IRT is the question whether a simple model like the Rasch model is suited to model responses to multiple-choice questions. The mainstream attitude in the U.S. is to reject this model with the argument that it cannot handle correct answers that come about by guessing. It will be argued that it is fairly well possible to apply the Rasch model in the case of multiple choice items, by considering the items as a choice situation with a finite number of alternatives and by applying Luce's choice axiom to these situations and at the same time accounting for individual differences in the scale values resulting from this choice axiom. Some implications for practical applications will also be discussed.

    Norman Verhelst was born in 1946. He studied Psychology at the Catholic University of Leuven (Belgium) with emphasis on mathematical pschology and psychometrics. Further career: Teaching Statistics, Methodology and Psychometrics at the universities of Leuven, Nijmegen, Utrecht and Twente. Since 1985 he worked in research and consultancy in Psychometrics at the National Institute for Educational Measurement (Cito), Arnhem, The Netherlands. Main interest: Item Response Theory as applied in National Assessment Programs and International comparative studies (PISA).


    VOC-voorjaarsbijeenkomst 2003

    Algemene Rekenkamer, Den Haag

    Vrijdag 28 maart 2003

    Het programma van de voorjaarsbijeenkomst ziet er weer aantrekkelijk uit. Dit maal is er gekozen voor een variatie van onderwerpen, waarin de twee peilers van onze vereniging, ordinatie en classificatie, centraal staan. In de presentatie van de keynote speaker Tom Snijders uit Groningen worden deze twee onderwerpen gecombineerd. Zijn presentatie zal gaan over social network analysis. We zijn dit keer te gast bij de Algemene Rekenkamer in Den Haag met als lokale organisator Berrie Zielman. Het belooft een mooie en interessante dag te worden.

    • 10:00 - 10:30 Welkom
    • 10:30 - 11:05 Uzay Kaymak (Erasmus Universiteit Rotterdam)
    • 11.05 - 11.40 Mark de Rooij (Universiteit Leiden)
    • 11.40 - 12.15 Patrick Groenen (Erasmus Universiteit Rotterdam) en Jeroen Poblome (Katholieke Universiteit Leuven)
    • 12.15 - 13.30 Lunch
    • 13.30 - 14.05 Kaatje Bollaerts, Iven van Mechelen (Katholieke Universiteit Leuven) en Paul Eilers (Universiteit Leiden)
    • 14.05 - 14.40 Anja Struijf (Universiteit Antwerpen)
    • 14.40 - 15.10 Tea
    • 15.10 - 16.10 KEYNOTE: Tom Snijders (Rijksuniversiteit Groningen)
    • 16.10 - 16.30 Ledenvergadering
    • 16.30 - later Borrel

    Locatie:Algemene Rekenkamer, Lange Voorhout 8, Den Haag. De hoofdingang is niet zichtbaar vanaf het Lange Voorhout en is bereikbaar via het poortje dat ligt tussen de huisnummers 4 en 6 (rechts van de Kloosterkerk).

    Route informatie: http://www.rekenkamer.nl/9282000/d/routebeschrijving.pdf

    Opgeven bij Marieke Timmerman, email: M.E.Timmerman@ppsw.rug.nl

    Abstracts and CVs

    Uzay Kaymak (Erasmus Universiteit Rotterdam)

    Discovering structure in data sets by fuzzy clustering

    Fuzzy models have gained in popularity in various fields such as control engineering, decision-making and data mining. One of the important advantages of fuzzy models is that they combine numerical accuracy of universal function approximators with transparency in the form of linguistic rules. Hence, fuzzy models take an intermediate place between numerical and symbolic models. A method that is used often for obtaining fuzzy models is fuzzy clustering. Fuzzy clustering algorithms are unsupervised techniques that partition a data set into overlapping groups based on similarity within the groups and dissimilarity amongst the groups. They can be used to discover latent structure in a data set. In this contribution, we explain the basics of fuzzy clustering with an emphasis on objective-function-based fuzzy clustering algorithms such as the fuzzy c-means algorithm. We discuss a number of issues regarding the selection of fuzzy clustering parameters and illustrate how the structure discovered through fuzzy clustering can be translated into a fuzzy model.

    Uzay Kaymak is an assistant professor at the Department of Computer Science of the Faculty of Economics of Erasmus University Rotterdam, the Netherlands. He obtained the degree of Chartered Designer in Information Technology and his Ph.D. from Delft University of Technology, the Netherlands in 1995 and 1998, respectively. Between 1997 and 2000 he worked as reservoir engineer at Shell International Exploration and Production in Rijswijk, the Netherlands. His research interests include fuzzy decision making, data mining for marketing and finance, and intelligent agents for financial modeling. He is a co-author of the recently published book "Fuzzy Decision Making in Modeling and Control" as well as of numerous papers on fuzzy systems and their applications. He is also an associate editor of IEEE Transactions in Fuzzy Systems. kaymak@few.eur.nl or u.kaymak@ieee.org.

    Mark de Rooij (Universiteit Leiden)

    Statistical Modeling using Euclidean Distances

    Generalized linear models have become a useful statistical framework for multivariate analysis. In the present paper we will assume categorical predictors and show how Euclidean distances can be used to obtain a graphical representation of the solution. Especially in the case of interacting predictors, the distance representation is useful and dimensionality restrictions can be imposed to get higher power and simpler interpretation. With common Euclidean distances only bivariate interactions can be represented. Recently triadic distances have been proposed and we will discuss the way these distances represent trivariate interactions.

    Mark de Rooij is Universitair Docent bij de sectie Methoden en Technieken van Psychologisch Onderzoek aan de Universiteit Leiden. Hij promoveerde in 2001 op het proefschrift getiteld "Distance models for transition frequency data" onder begeleiding van Willem Heiser. Zijn onderzoeks interesses liggen op het gebied van meerdimensionale schaling en ontvouwing en het gebruik van dergelijke technieken in statistische modellen. Daarnaast onderzoekt hij uitbreidingen van afstandsmodellen naar afstanden tussen drie punten en eigenschappen van dergelijke triadische afstanden.

    Patrick Groenen (Erasmus Universiteit Rotterdam) en Jeroen Poblome (Katholieke Universiteit Leuven)

    Constrained correspondence analysis for seriation in archaeology

    Correspondence analysis is a well known technique for seriation of archaeological artefactual assemblages. One problem with the seriation solution is that no explicit time frames are obtained, only a relative ordering. However, in some cases additional information is available allowing absolute dating of some of the deposits. Such explicit dating information may be obtained from associated categories of finds, such as coin series. Additional information may be available that logically restricts the order of the seriation. For example, in case of superposed stratigraphical sequence, the lower stratum is associated with events which took place earlier than the upper layer and consequently this ordering should be replicated in the seriation. In this paper, we propose a constrained form of correspondence analysis that takes such restrictions into account. Using these constraints we are able to assign explicit dates to a seriated solution. This new method of seriation is applied to a series of ceramic assemblages consisting of the locally produced tableware from Sagalassos (SW Turkey). These tableware assemblages have already been seriated and dated independently via empirical archaeological techniques (Poblome, 1999). To establish the stability of the solution, we use the bootstrap method (Efron and Tibshirani, 1993).

    Patrick Groenen is full professor in statistics at the Econometric Institute of the Erasmus University Rotterdam. Currently, he is president of the VOC. He has written several articles in the area of multivariate analysis, multidimensional scaling, global optimization, clustering, and majorization. In 1997, a textbook called 'Modern Multidimensional Scaling' appeared of which he was co-author of Ingwer Borg. He is associate editor of Statistica Neerlandica and Computational Statistics and Data Analysis.

    Kaatje Bollaerts, Iven Van Mechelen (Katholieke Universiteit Leuven), Paul Eilers (Leids Universitair Medisch Centrum)

    Constrained P-spline regression

    In various research areas including psychology, the relationship between predictor and criterion variables often assumed to be of a particular non-parametric functional form, such as a monotone, single-peaked stepwise relation. In this talk, we will present a method check for such assumptions. This method is essentially non-parametric regression with constraints that reflect the assumed non-parametric form. As such, it constitutes golden mean between exploratory and confirmatory data analytic approaches. In particular, we will discuss Pspline regression with additional asymmetric penalties enforcing monotonicity constraints. The latter will illustrated with data from research on cognitive development of children.

    Kaatje Bollaerts obtained the degree of Master Psychology at the University of Leuven in 2001. At present she is preparing a PhD-thesis at the same university. Her main research interests include visualization techniques and methods to capture interactions.

    Anja Struyf (Universiteit Antwerpen)

    Visualizing clusters using data depth

    Bivariate clusters are easy to visualize using a scatter plot. Pison et al. (1999) proposed a program, called clusplot, that also visualizes high-dimensional clusters in a twodimensional graph. Before plotting, they reduce the dimension of the data by means of principal component analysis (object-by-variables data) or multidimensional scaling (dissimilarity data). Clusters are then separated by means of ellipses (or ellipsoids in 3D). Instead of using ellipses, one could use other bivariate plots that display more detailed characteristics of the individual clusters. One possibility is the bagplot Rousseeuw et al. 1999), a bivariate generalization of the well-known boxplot, which is based on data depth (Tukey 1975). The Tukey depth, a multivariate rank, gives points located near the center of the data cloud a high rank, while points outside the convex hull of the data have depth 0.

    References:

    G. Pison, A. Struyf, P.J. Rousseeuw (1999), Displaying a clustering with CLUSPLOT, Computational Statistics and Data Analysis, 30, 381-392.

    J. Rousseeuw, I. Ruts, J.W. Tukey (1999), The bagplot, a bivariate boxplot, The American Statistician, 53, 382-387.

    W. Tukey (1975), Mathematics and the picturing of data, Proceedings of the International Congress of Mathematicians, 2, 523-531, Vancouver.

    Anja Struyf obtained her PhD in statistics at the University of Antwerp in 2000.Now she is postdoctoral fellow of the F.W.O.-Vlaanderen and guest professor at the University of Antwerp. Her research interests mainly are in the field of robust statistics: data depth, estimators of skewness and tail weight, and cluster analysis. Special attention is given to the development of fast algorithms for the newly developed methods.

    Tom A.B. Snijders (University of Groningen)

    Latent structure models in social network analysis

    In social network studies but also in various other fields, directed graphs are a useful structure to represent data about relations between units (or vertices, or nodes) - also called network data. One type of model for network data is a kind of latent class model, in which there is a partition of nodes into equivalence classes, which determine the patterns of links between the units in these classes. In the social networks literature, this is called structural equivalence. From a data-analytic point of view, especially interesting is the case where the classes are latent and have to be inferred from the observed pattern of linkages. In the probabilistic version of such models - "stochastic equivalence" - , it is assumed that the existence of links (or more generally, link patterns) between each pair of units is random, and the probability distribution of the link patterns between two vertices depends only on the classes to which they belong. This leads to approximate block patterns in the adjacency matrix - hence the name of blockmodeling. Another type of model is a latent distance model, where the probability of a link is a decreasing function of the distance between two units, but the distance is not observed directly. Euclidean distances were studied by Hoff, Raftery, and Handcock; ultrametric distances were studied by Schweinberger and Snijders. Statistical approaches to parameter estimation in such latent distance models are discussed. These approaches are various forms of Markov Chain Monte Carlo algorithms, including Gibbs sampling, simulated annealing, and MCMC maximum likelihood estimation.

    Tom A.B. Snijders is professor of Methodology and Statistics at the Department of Sociology of the University of Groningen. He is also scientific director of the ICS, a research and graduate school in sociology in which researchers of the universities of Groningen, Utrecht, and Nijmegen participate. His research is in the domain of statistical modeling in the social sciences, with special interest in social network analysis and multilevel modeling, and in statistical models that (to some extent) reflect substantive theory. He is associate editor of Journal of Educational and Behavioral Statistics, Psychometrika, and Journal of Social Structure.

    VOC-najaarsbijeenkomst 2002

    Ordinatie en classificatie in biomedische toepassingen

    Erasmus MC, Rotterdam

    Vrijdag 8 november 2002

    • 10:00 - 10:30 Welcome
    • 10:30 - 11:05 Steffen Fieuws (Katholieke Universiteit Leuven)
    • 11.05 - 11.40 Sabine van Huffel (Katholieke Universiteit Leuven)
    • 11.40 - 12.15 Caspar Looman (Erasmus MS, Rotterdam)
    • 12.15 - 13.45 Lunch
    • 13.45 - 14.20 Elise Dusseldorp (Leiden University)
    • 14.20 - 14.55 France Portrait (Vrije Universiteit, Amsterdam)
    • 14.55 - 15.30 Merel van Dijk (Erasmus MC, Rotterdam)
    • 15.30 - 16.00 Tea
    • 16.00 - 17.00 Lawrence Hubert (Un of Illinois, Champaign)
    • 17.00 - later Drinks

    Locatie: Erasmus MC, Rotterdam. Route informatie

    Opgeven bij Paul Eilers

    Abstracts and CVs

    Steffen Fieuws (Biostatistical Centre, Katholieke Universiteit Leuven)

    Classification of Longitudinal Profiles using Nonlinear Mixed-Effects Models

    In different biomedical situations, markers are needed to detect the onset of a specific disease as soon as possible. Often, a series of marker measurements turns out to be a better screening tool than a single measurement. This fact urges the development of classification strategies using longitudinal information. Results of classical discriminant analysis can be applied to classify new subjects in diagnostic groups (e.g., disease, no disease) whenever the data have a balanced structure. Extensions have been proposed for unbalanced data. These extensions use linear mixed effects models or linearized versions of nonlinear models to describe the longitudinal profiles in each diagnostic group. Using these group-specific descriptions, posterior probabilities of group membership are calculated to classify individual profiles. We will present an extension of the proposed strategy by using linear as well as nonlinear mixed effects models for the description of group-specific evolutions. The extension will be illustrated using 342 PSA profiles collected in the Baltimore Longitudinal Study of Aging. It will be shown that the approach leads to a faster detection of the onset of prostate cancer.

    With a background in Experimental Psychology and a Msc. in Statistics, Steffen Fieuws is currently working at the Biostatistical Centre of the K.U.Leuven. Since 1999, he also acts as a statistical consultant for the Department of Medicine. His areas of interest include the development of classification strategies using longitudinal data and the modelling of multivariate longitudinal responses. The involved models contain linear as well as nonlinear mixed effects.

    Sabine Van Huffel* (Department of Electrical Engineering, Division ESAT-SCD, Katholieke Universiteit Leuven)

    Linear versus nonlinear classifiers for preoperative discrimination between malignant and benign ovarian tumors

    Ovarian malignancy has the highest mortality rate among gynaecologic cancers. An accurate preoperative discrimination between malignant and benign ovarian tumors is critical to obtain the most effective treatment and best advice. In this study, we develop and evaluate several classifiers: linear ones such as logistic regression models, as well as nonlinear ones such as multilayer perceptrons (MLP) and least squares support vector machines (LS-SVMs) to preoperatively predict malignancy of ovarian tumours. The optimal input variables are identified via an exploratory multivariate data analysis, followed by a stepwise and forward selection procedure which aims to maximize the predictive ability of the models. The performance of the models is assessed via Receiver Operating Characteristic (ROC) curve analysis. The experimental results suggest that both MLPs and LS-SVMs have the potential to obtain a reliable preoperative classification of the ovarian tumours, and that their performance is just comparable.

    * Joint work with C. Lu, T. Van Gestel, J.A.K. Suykens (ESAT-SCD) and I. Vergote, D. Timmerman (U.Z. Leuven).

    Sabine Van Huffel is full professor at the Katholieke Universiteit Leuven, Belgium. Her research interests are in numerical linear algebra, errors-in-variables regression, signal processing, system identification, data mining, pattern recognition, and (non)linear modelling (using neural networks, Bayesian networks and support vector machines). Special attention is given to the design of reliable algorithms and their practical evaluation in medical diagnostics and classification (e.g., optical and magnetic resonance spectroscopy, pre-operative classification of ovarian tumours). In these areas, she has authored 2 books, 50 papers and 100 conference contributions.

    Caspar WN Looman* (Erasmus MC)

    The level and time course of disability: trajectories of disability in adult and elderly persons

    Purpose of study. This study aims (1) to identify common patterns ('trajectories') in the level and time course of disability, (2) to determine the relative frequency of each trajectory and (3) to assess the relationship of these trajectories with predictors at the start of the observation period. Design and methods. Our data consisted of a population-based longitudinal study in 15-74 year old Dutch persons: over 2800 persons participated who ideally contributed information on six timepoints in seven years. We used cluster analysis to group persons with similar levels and time course of disability (i.e. direction, linearity of change and variability of disability) into common trajectories. Multinomial regression was used to assess the relationship of the trajectories with several predictors as for instance presence of disease at start of followup or risk factors such as smoking or alcohol use at baseline. Special efforts were made to handle response bias due to attrition and to correct for the stratification in the sampling design. Results. We identified nine trajectories of disability and one trajectory including all deaths. 74 % of the population between 14 and 74 years at baseline is permanently nondisabled. The size of the trajectories with disability varies from 10 % (permanently mildly disabled) to 0.5% (severely disabled with large increase in disability). Imputation of respondents who missed one or two questionnaires often lead to a consistent allocation in spite of the missing data; when more questionnaires were missing the allocation often varied between imputation replica's. The logistic regression showed a higher risk for trajectories with disability for alcohol abstainers as compared to heavy alcohol users. Implications. Disability is a dynamic process, showing important differences in the level and time course between individuals. The implications of this diversity for conclusions of prior studies based on two measurements and for the burden of disability and the need for care are discussed. Information of respondents with only a few missing time points should be used in the analysis by some process of imputation.

    * Joint work with Wilma J. Nusselder (Erasmus MC)

    Caspar Looman (cultuurtechnicus) is statisticus op het instituut Maatschappelijke Gezondheidszorg van het Erasmus MC. Hij helpt graag onderzoekers bij het vinden van adequate analysemethodes bij hun problemen. Zijn interesse gaat vooral uit naar factor- en clusteranalyse, loglineaire analyse, imputatie, stratificatie en regression to the mean.

    Elise Dusseldorp (Data Theory Group, Leiden University)

    Which patients benefit more from which type of treatment? A new way to assess treatment covariate interaction

    A new analysis strategy is proposed, called the regression trunk approach (RTA). It combines two existing analysis methods: regression trees (Breiman, Friedman, Olshen & Stone, 1984) and multiple linear regression analysis (also, see Dusseldorp & Meulman, 2001). RTA traces interaction effects between predictors, in the regression of one continuous response variable on multiple predictor variables. In this paper, RTA is used to assess treatment covariate interactions, in the regression of one continuous variable on a treatment variable and multiple covariates. Treatment Covariate Interaction (TCI) is defined as the interaction between a treatment variable (categorical, with K categories) and a covariate (continuous or categorical). The presence of a TCI indicates that the effectiveness of one treatment (e.g., cognitive therapy) differs for subjects with different values on the covariate (e.g., locus of control), compared to another treatment (e.g., antidepressants). RTA encompasses three phases. In the first phase, a main effects model is estimated by multiple regression. In the second phase, a small regression tree (called “a regression trunk”) is fitted on the residuals of the main effects model. This is done in a special way. The first split is forced on the treatment variable. The remaining splits are free. The resulting regression trunk represents one or more TCI’s by so-called threshold interactions (which differ from the commonly used cross-product interactions). In the third phase, the regression trunk is converted into contrast variables and these are added as a second block to the multiple regression main effects model. The cross-validated variance accounted for (C-VAF) of the model including the contrast variables is compared to the C-VAF of the main effects model. Both RTA and the classical method of forward stepwise regression were applied to a real data example from patients with panic disorder. (The data were obtained from Bakker et al, 1999.) The final RTA model had a higher cross-validated variance accounted for (29.8%) than the classical model (12.5%). The results indicated that the “normal” cross-product to represent an interaction was not appropriate to these data. The threshold interaction, discovered by RTA, revealed that patients with a medium internal control orientation benefit more from cognitive therapy. Finally, the results of a simulation study will be presented.

    Elise Dusseldorp studeerde psychologie (richting “Methoden en Technieken”) aan de Universiteit Leiden (UL). Ze promoveerde in 2001 op het proefschrift “Discovering Treatment Covariate Interaction: An Integration of Regression Trees and Multiple Regression” met als promotoren prof. dr. J. J. Meulman en prof. dr. S. Maes. In haar proefschrift ontwikkelde zij een methode om differentiële effecten van behandelingen op te sporen, de “Regression Trunk Aproach”. In juli 2002 ontving zij een VENI-subsidie van NWO. Het onderwerp van haar VENI-onderzoek is: “Modelling interaction effects as small trees in regression and classification”. Dit onderzoek zal in november starten bij de Datatheorie Groep van de vakgroep Pedagogische Wetenschappen aan de UL. Momenteel is zij werkzaam als post-doc onderzoeker bij dezelfde groep. Haar onderzoeksinteresse zijn: classificatie- en regressiebomen, regressie analyse, interactie effecten, optimaal schalen, en data mining. Medische, psychologische en pedagogische toepassingen hebben haar speciale interesse. Hiervan getuigen haar publicaties in internationale tijdschriften op medisch en psychologisch gebied.

    France Portrait (Free University, Amsterdam)

    The grade of Membership analysis: an application to the Dutch elderly population

    With the aging of the society, issues concerning the reform of the Dutch health care system are ranked high on the political agenda. Sensible reforms of the health care system for the elderly require a thorough understanding of the functional status of the old and of the dynamics in the functional status preceding death. This paper is about these issues. The functional status of the elderly is intrinsically a multidimensional concept that may vary across the life-cycle and a rich set of indicators is needed to capture the multidimensional aspect of health in its full extent. This multidimensionality is however a weakness as it will in general be difficult to handle with all indicators in any economic analysis. In the paper we focus on methods that comprise these multidimensional measures into a limited number of indices. The Grade of Membership approach introduced by Manton and Woodbury in 1982 is specifically designed to characterize the complex concept of health. The method simultaneously identifies all dimensions of the concept of interest and the degrees to which an individual belongs to each of these types. We apply the method on a set of 21 indicators from a rich database of the Longitudinal Aging Study Amsterdam. The individual degrees of involvement in the different health dimensions obtained from this method are used in subsequent analyses of the relation between health and mortality.

    France Portrait is werkzaam als postdoc aan de Vrije Universiteit te Amsterdam. Zij werkt parttime op de Economische faculteit van de VU en parttime bij de Longitudinal Aging Study Amsterdam research group. Ze is gepromoveerd in 2000 op het proefschrift getiteld: Long-Term Care Service for the Dutch Elderly - An investigation into the Process of Utilization. Haar promotoren waren Prof. Dr. M. Lindeboom, Prof. Dr. D. Deeg en Prof. Dr. A.H.Q.M. Merkies. Momenteel werkt ze voornamelijk aan twee projecten die zich richten op het vaststellen van (i) de effecten van weduwschap op gezondheid en mortaliteit en (ii) het verloop van gezondheid voor opeenvolgende cohorten.

    Merel van Dijk* (Department of Public Health, Erasmus MC)

    From regression analysis to prognostic classification: a case study of the International Germ Cell Consensus Classification for nonseminomatous germ cell cancer

    The International Germ Cell Consensus (IGCC) classification was developed to distinguish high and low risk patients with nonseminomatous germ cell tumors, and to facilitate collaborative trials. We reconsidered some steps in the construction of this simple prognostic classification while using the same risk factors.The IGCC classification is based on the risk factors location of primary tumor, presence of non-pulmonary visceral metastases, and elevation of the tumor markers alpha-fetoprotein (AFP), human chorionic gonadotrophin (HCG) and lactic dehydrogenase (LDH). The tumor markers were recoded into one risk factor and simple weights were allocated to the risk factors, after which the risk factors were combined into 3 prognosis groups (good, intermediate and poor). Performance might be improved by using weights that are based on regression coefficients giving a better indication of the predictive strength of the risk factors. Alternative classifications were defined with 3 risk factors (tumor markers combined) and 5 risk factors (tumor markers separately), using regression based weights. Although the weights differ from the simple weights of the IGCC classification, the alternative classifications are not much better in distinguishing high risk from low risk patients. The survival rates for good, intermediate and poor prognosis for all three classifications are 92%, 80% and 50% and the discriminative ability was 0.74. Within the poor prognosis however we were able to distinguish two groups of patients with significantly differing survival rates (65% vs. 41%). We conclude that given the prognostic information used, the IGCC classification performs well. Splitting the poor prognosis group in two groups leads to a more accurate classification. Further possibilities for the improvement of the discriminative ability, like adding more risk factors and imputing missing values, will be discussed.

    * Joint work with E.W. Steyerberg and J.D.F. Habbema (Department of Public Health, Erasmus MC, Rotterdam, The Netherlands), and S.P. Stenning (Medical Research Council, Clinical Trials Unit, London, UK)

    Merel van Dijk graduated in Psychology at the University of Amsterdam in 2001. Currently she is working as a PhD-student at the Department of Public Health, at the Erasmus MC in Rotterdam on the project ‘Defining poor and good prognosis groups in oncology’ funded by NWO. The aim of this project is to gain more insight into what optimal methods are for constructing classifications.

    Lawrence Hubert (University of Illinois, Champaign)

    The Representation of Proximity Matrices by Tree Structures: A Tree Structure Toolbox (TST) for MATLAB

    We present and illustrate the capabilities of a MATLAB Toolbox for fitting various classificatory tree structures to both symmetric (one-mode) and rectangular (two-mode) proximity matrices. The emphasis is on identifying ultrametrics and additive trees that are well-fitting in the L_{2} norm by heuristic extensions of an iterative projection (least-squares) strategy. The (additive) fitting of multiple tree structures is also addressed.

    Lawrence Hubert is currently the Lyle H. Lanier Professor of Psychology (and Professor of Statistics and Educational Psychology) at the University of Illinois, Champaign. His research interests center on data analysis methods in psychology and the behavioral sciences generally, with particular emphasis on representation techniques and strategies of combinatorial data analysis, including exploratory optimization approaches and confirmatory nonparametric methods.


    VOC-voorjaarsbijeenkomst 2002

    DATA MINING

    Hoog Brabant, Utrecht

    Vrijdag 26 april 2002

    De Belastingdienst is dit voorjaar de gastheer van de VOC Datamining bijeenkomst op vrijdag 26 april 2002. Aan bod komen de datamining technieken, toepassingen en de laatste ontwikkelingen op dit gebied. De Belastingdienst laat zien hoe met datamining technieken beter belasting geheven en ingevorderd kan worden. In een andere voordracht wordt de vraag gesteld of er met datamining nog wel plaats is voor de statisticus. Een volgende spreker belicht zowel de algorithmes als IT gerelateerde oplossingen. Verder wordt er meer informatie uit surveys gehaald via datamining. Ook is er plaats voor een wat technischer verhaal over associatie regels. De diversiteit van de voordrachten is groot. Reden te over om deze voorjaarsbijeenkomst bij te gaan wonen.

    • 10:00 - 10:30 Ontvangst
    • 10:30 - 11:10 Edith Nijenhuis & Frans J E Vermeulen (Belastingdienst, Utrecht)
    • 11.10 - 11.50 Irma Volkers (Belastingdienst, Utrecht)
    • 11.50 - 12.30 Marten den Uyl (Sentient Machine Research, Amsterdam)
    • 12.30 - 14.00 Lunch
    • 14.00 - 14.40 Machiel Westerdijk (CapGemini, Utrecht)
    • 14.40 - 15.20 Geert Wets (DAM LUC, Diepenbeek)
    • 15.20 - 15.40 Koffie-thee
    • 15.40 - 16.20 Joost Kok (Universiteit Leiden)
    • 16.20 - 16.45 Ledenvergadering
    • 16.45 - later Borrel

    Locatie: Hoog-Brabant, Hoog Catherijne. Utrecht. Route-informatie: treinreizigers of anders.

    Abstracts and CVs

    Edith M Nijenhuis, Frans J E Vermeulen Re Ra (Belastingdienst, Utrecht)

    De ene zeem is de andere niet: profielen in het schoonmaak segment

    De Belastingdienst hanteert het zogenaamde Risico Beheersings Model. Dit model gaat uit van een risicogerichte benadering. Hierbij worden risico's gedefiniëerd als de kans op niet voldoen aan fiscale verplichtingen. Er worden algemene en segmentspecifieke risico's onderkend. Een segment bestaat uit enkele gerelateerde branches. Voor elk segment zijn een segmentbeschrijving en een segment-behandelplan gemaakt waarin het segment wordt beschreven evenals de aanpak van de segment-specifieke risico's. Zo'n segmentbehandelplan is een nominale beschrijving van de belangrijkste risico's in een segment. Vanuit de behoefte meer kwantitatief inzicht te krijgen in segmenten en nieuwe segmentspecifieke risico's op het spoor te komen is een pilot uitgevoerd voor het segment 'schoonmaak'. Allerlei fiscale en niet fiscale gegevens zijn gebruikt om profielen te verkrijgen van subpopulaties van het segment. Met behulp van datamining technieken, waarbij principale componenten analyse de meest gebruikte was, zijn profielen verkregen.

    Edith Nijenhuis werkt sinds 1 januari 2000 als onderzoeker/methodoloog bij het Centrum voor Proces- en Productontwikkeling van de Belastingdienst. Zij is daar ingestroomd vanuit de FIOD/Centrale Vestiging Informatie, waar zij mvanaf december 1998 werkzaam was. Edith studeerde persoonlijkheidspsychologie en methoden en technieken van psychologisch onderzoek.

    Frans Vermeulen werkt sinds 1 januari 2000 als onderzoeker/EDP-auditor bij het Centrum voor Proces- en Productontwikkeling van de Belastingdienst. Hij is daar ingestroomd vanuit het Belastingdienst Informatie Centrum, waar hij sinds 1997 werkzaam was. Daarvoor werkte hij als accountant en EDP-auditor bij het Douanedistrict Roermond.

    Irma Volkers (Belastingdienst, Utrecht)

    Insolventie-onderzoek met datamining-technieken bij de Belastingdienst

    Uit intern onderzoek is gebleken dat de belastingdienst een substantieel belasting bedrag bij bedrijven niet kan invorderen. Een belangrijke oorzaak daarvan is het feit dat dreigende insolventie niet tijdig wordt gesignaleerd, onder andere omdat niet optimaal gebruik wordt gemaakt van de gegevens waarover de Belastingdienst beschikt.

    In deze presentatie wordt het onderzoek beschreven dat opgezet was om een mogelijke oplossing voor dit probleem aan te leveren. Het doel van dit onderzoek was het ontwikkelen van een instrument (model) waarmee tijdig insolventie kan worden gesignaleerd. Insolventie is hierbij gedefinieerd als het niet nakomen van de fiscale betalingsverplichtingen. De verwachting is dat verliezen kunnen worden voorkomen dan wel kunnen worden beperkt door risico's sneller te signaleren. Omdat de databestanden zeer omvangrijk waren, is besloten om moderne data-analyse technieken, zoals datamining technieken toe te passen. Het grootste probleem bij het onderzoek bleek om de data geschikt te maken voor analyse. Dit aspect heeft het grootste deel van de onderzoekstijd gekost.

    De uiteindelijke resultaten lijken er op te wijzen dat statistische modellen om insolventie te voorspellen een toegevoegde waarde kunnen hebben bij het inningsproces in vergelijking met de huidige werkwijze van de Belastingdienst.

    Irma Volkers is als afgestudeerd psychologe (cognitieve psychologie, kunstmatige intelligentie, machine learning) aan de UVA, onderzoeker geweest bij Cartesian Products (zelflerende systemen) en Cap Gemini (kennissystemen). Nu is zij werkzaam bij het Centrum voor Proces en Productontwikkeling van de Belastingdienst in Utrecht. De onderwerpen waar zij mee bezig is zijn kunstmatige intelligentie, audit automatisering, datamining en internettechnologie.

    Marten den Uyl (Sentient Machine Research, Amsterdam)

    Data Mining voor Eindgebruikers

    Een mogelijke definitie van 'data mining' is het vergaand geautomatiseerd uitvoeren van grote aantallen statistische analyses op een gegevensverzameling, met het doel daarmee onverwachte en interessante zaken samenhangen, verbanden trends- in de gegevens op te sporen. Vanuit dit perspectief kan begrepen worden dat de 'heilige graal' voor data mining software ontwikkeling het'weg-automatiseren van de statisticus' is; d.w.z. het eindgebruikers in staat stellen om -zonder over statistische expertise te beschikken- zelfstandig analyses op hun gegevens te plegen. Zoals het gaat met heilige gralen is dat geen makkelijk te realiseren doel. Aan de hand van een aantal praktijktrajecten zullen mogelijkheden en beperkingen van data mining voor eindgebruikers gedemonstreerd worden. Tevens zal worden ingegaan op de eisen die 'usability' en 'robustness' aan onderliggende algoritmieken stellen en waarom sommige typen algoritmen -fuzzy matching, nearest neighbors- daaraan beter lijken te kunnen voldoen dan andere.

    Marten den Uyl is in 1978 afgestudeerd in de cognitieve psychologie aan de Universiteit van Amsterdam. Daarna werkte hij als onderzoeker aan de Universiteit van Amsterdam en aan Stanford University (Californie) op diverse onderzoeksgebieden in de psychologie: tekstbegrip en controleprocessen bij het lezen, beoordelingstheorie, etnische attitude, connectionistische (neurale) modellering. In 1987 werd Den Uyl consultant kennistechnologie bij Bolesian, een van de eerste AI bedrijven in Nederland. In 1990 richtte Den Uyl Sentient Machine Research op. Sentient is gespecialiseerd in het toepassen van adaptieve en associatieve technieken om informatie uit grote gegevensstromen en -bestanden beter toegankelijk, inzichtelijk en bruikbaar te maken. Sentient is leverancier van DataDetective visual data mining software en levert diensten op het gebied van data-analyse en ontwikkeling van maatwerksoftware aan uiteenlopende klanten.

    Machiel Westerdijk (CapGemini, Utrecht)

    Data Mining voor Business Intelligence

    Huidige bedrijven en organisaties verzamelen steeds meer informatie, bijvoorbeeld via het web, call centers en transactie- en registratiesystemen. Naast een operationele functie heeft deze informatie steeds vaker een rol op een hoger niveau, namelijk om inzicht te verschaffen in de toestand van de organisatie en haar interactie met de omgeving. Speciaal voor deze taak bouwen veel organisaties database systemen die toegewijd zijn aan management en marketing analyses. De analysemethoden die gebruikt worden zijn meestal niet geavanceerd. Vaak worden slechts rapporten geproduceerd waarin rechte tellingen en gemiddelden worden gepresenteerd die vragen beantwoorden als `wat was de omzet het laatste kwartaal in regio A?’. Belangrijke kennis zit echter vaak in trends en patronen die op een gecompliceerde wijze van veel soorten gegevens afhangen. Voor het opsporen van dit soort kennis wordt data mining technologie ingezet. In deze voordracht zal ik een viertal gebieden bespreken waarin veel behoefte is aan data mining oplossingen, namelijk 1. financiering in de gezondheidszorg, 2. fraude binnen de sociale zekerheid, 3. krediet risico in de bankwereld en 4. retentie en klantsegmentatie in de marketing. Binnen deze gebieden kan een data mining project verschillende vormen hebben. Het kan variëren van een specialistisch onderzoekstraject, met de nadruk op algoritmische geavanceerdheid, tot het neerzetten van een operationeel IT systeem, met de nadruk op technologische geavanceerdheid. Beide situaties geven een goed beeld van het soort werk waarmee we te maken hebben in de data mining wereld. Voor veel praktische problemen waar organisaties mee worstelen bieden de huidige beschikbare data mining methoden reeds een geschikte oplossing. Voor toekomstige ontwikkelingen is het interessant om te kijken naar de vragen waar nog geen adequaat antwoord op te geven is. In het laatste deel van de voordracht zal ik de volgende vragen nader beschouwen: 1. hoe combineren we gegevens van verschillende aard, zoals tekst data en numerieke data?, en 2. hoe combineren we expert kennis en kennis uit data? Voor het antwoord op deze vragen wordt met name veel verwacht van de Bayesiaanse statistiek en grafische modellen.

    Machiel Westerdijk heeft vliegtuigbouwkunde en natuurkunde gestudeerd. Na zijn studies heeft hij promotieonderzoek gedaan op het gebied van data mining en machine learning aan de Universiteit van Nijmegen. Het onderzoek richtte zich op het ontwikkelen van nieuwe technieken voor classificatie en patroonontdekking. De titel van het proefschrift is: `Hidden Variable Models for Knowledge Discovery’. Na de promotieperiode is hij een jaar lang werkzaam geweest als post-doc bij de Stichting Neurale Netwerken (SNN) op een europees project met als onderwerp het herkennen van emoties uit spraak en video data. Sinds juni 2001 is hij als consultant werkzaam bij Cap Gemini Ernst & Young binnen de unit `Business Intelligence & Datawarehousing’. Hier is hij betrokken bij verschillende projecten, binnen de gezondheidszorg, de bankwereld en marketing, waarin data mining een belangrijke rol speelt.

    Geert Wets (DAM LUC, Diepenbeek, Belgie)

    Detect latent dissatisfaction in surveys in the financial sector

    One of the main problems of interpreting the results of customer satisfaction surveys is that some of the customers are still churning although they report to be overall satisfied. In this talk it will be explained how using data mining techniques these latently dissatisfied customers can be detected and a profile of them can be given. To illustrate the approach, a case study in the financial sector will be described.

    Geert Wets is professor of business informatics at Limburg University, faculty of applied economics. He holds a degree as commercial engineer in business informatics (Catholic University of Leuven) and a PhD from Eindhoven University of Technology. He has published in several leading journals including 'Fuzzy Sets and Systems', 'Intelligent Systems' and 'Information Systems'. Furthermore, he has presented papers at many international conferences (E.g., KDD, AAAI and ECAI). His current research entails data mining, analytical CRM, and fuzzy set theory. He has lectured on the subject of data mining and data warehousing on several occasions.

    Joost N. Kok (Leiden Institute of Advanced Computer Science)

    Pre-computing and Post-processing of Association Rules

    Data mining is the search for hidden information in large databases. This information is in practice often in the form of so-called association rules, i.e., a special type of if-then rules (for instance: if a customer buys a newspaper, he or she also buys chocolate). Association rules are frequently used in market basket analysis (dealing with 0--1 matrices), but are also applicable for other types of databases. There exist several algorithms that generate association rules, such as the Apriori algorithm. The number of these rules (or the underlying frequent itemsets) is usually enormous and therefore hardly useful. One way to decrease the number of rules is to apply different notions of interestingness, thereby sorting the rules or filtering out the best ones. In the presentation we introduce assocation rules, describe the Apriori algorithm and show some methods that are active during the computation phase (pre-computing) and methods that operate once rules have been found (post-processing).

    Joost N. Kok studeerde Wiskunde aan de Universiteit van Amsterdam. Daarna werkte hij als promovendus op de Vrije Universiteit en later op het Centrum voor Wiskunde en Informatica in Amsterdam. Hij promoveerde in 1989 aan de Vrije Universiteit op the proefschrift "Semantic Models for Parallel Computation in Data Flow, Logic- and Object-Oriented Programming" met als promotor prof. dr J.W. de Bakker. Daarna werd hij universitair docent en later universitair hoofddocent Neurocomputing aan de Universiteit van Utrecht. Gedurende deze periode werkte hij ook enige tijd aan de Abo Akademi University in Finland. Sinds 1995 is hij hoogleraar Fundamentele Informatica aan de Universiteit Leiden. Zijn onderzoeksinteressen zijn bioinformatica, coordinatie van software componenten, optimalisatie en data mining. Op deze gebieden leidt hij de ALP onderzoeksgroep.Hij is opleidingsdirecteur Informatica, voorzitter van de Belgisch/Nederlandse vereninging voor Kunstmatige Intelligentie en editor van diverse tijdschriften en boekenreeksen. Voor meer informatie: www.liacs.nl/~joost


    VOC-najaarsbijeenkomst 2001

    Universiteit van Leiden (UL)

    Vrijdag 26 oktober 2001

    Zet deze datum alvast in je agenda want het belooft een prachtige dag te worden.

    De gastheer is Vakgroep Medische Statistiek van de Universiteit van Leiden. Hoofdspreker is Jerome Friedman uit Stanford, onder meer bekend van het baanbrekende boek Classification and Regression Trees. Verder zullen Brian Marx, Pieter Jan Stappers, Josephine Woltman Elpers, Bart-Jan van Os, en Hilde Tobi acte de presence geven.

    I.v.m. LUNCH graag even van tevoren aanmelden bij Paul Eilers.

    On October 26, the Dutch Society for Ordination and Classification (Vereniging voor Ordinatie en Classificatie, VOC) will have it?s Fall meeting in Leiden. We offer an interesting program, a blend of theory and applications. The venue will be in a lecture room in the Fysiology Building of the Leiden University Medical Centre. Information: Paul Eilers (p.eilers@lumc.nl, 071 527814).

    Program

    • 11:00 - 11:45 Brian Marx (Louisiana State University)
    • 11:45 - 12:20 Pieter Jan Stappers (Delft University of Technology)
    • 12:20 - 14:00 Lunch in Restaurant de Posthof
    • 14:00 - 14:35 Josephine Woltman Elpers (Groningen University)
    • 14:35 - 15:10 Hilde Tobi (Groningen University)
    • 15:10 - 15:40 Tea
    • 15:40 - 16:15 Bart-Jan van Os (Leiden University)
    • 16:15 - 17:00 Jerome H. Friedman* (Stanford, USA)
    • 17:00 - ??.00 Drinks

    Abstracts and CVs

    Brian D. Marx* (Louisiana State University)

    Multivariate Calibration Stability: A Comparison of Methods

    Multivariate calibration (MVC) essentially leads to a regression model, providing a high dimensional vector of estimated coefficients. These coefficients weigh the spectrum, or more generally the signal, and provide the estimated response, e.g. the concentration of a chemical component of interest. To develop a model, a set of calibrated concentrations (the response), and the accompanying spectra (measured at many wavellengths) are collected (the regressors). As there are generally many more regressors than training observations, classical linear regression breaks down and special methods have to be developed. A problem of great practical interest is reliable prediction and robustness of the MVC training model. The calibration data are generally collected with one instrument, under stable conditions. When the model is going to be used on-line, another instrument may be used and/or the conditions may change, which may lead to shifting, warping, and scaling of the spectra regressors. This is the problem we term (multivariate) calibration stability (MCS). We investigate the transfer prediction performance of three modeling strategies: partial least squares (PLS), principal component regression (PSR), and P-spline signal regression (PSR). Our results focus on a designed mixture experiment under temperature transfer conditions, where we examine when smoothness of the PSR coefficients, compared to the relatively erratic PLS/PCR coefficients, can lead to differing stability in prediction performance.

    *Joint work with Paul Eilers (Leiden University)

    Brian Marx is Professor in the Department of Experimental Statistics at Louisiana State University. He received his Ph.D. in Statistics from Virginia Tech. He currently on a 6 month sabbatical leave at Leiden University, The Netherlands. His central research interests include: smoothing with P-splines, signal regression, generalized linear and additive modeling, and biased estimation for ill-conditioned data. He can be reached through bmarx@lsu.edu

    Pieter Jan Stappers (Delft University of Technology)

    MDS-Interactive: An interactive visual tool for reflective exploration of loosely structured data.

    The design phase of product development and new areas of research in general, often start with informal exploration of loosely structured collections of data. Typically, computer-supported techniques only become helpful after the ideas have solidified, and their formulation has been consolidated in a precise form. In the ID-StudioLab, we are developing tools for designers in the early, explorative phases of a design project. We put an emphasis on direct, visual techniques with a minimal demand on verbal language. In part this is because designers often are strong in visual forms of reasoning, whereas current computer tools often require precise verbal formulations. Moreover it is relevant because the members in today's multidisciplinary design teams often lack a common disciplinary language. In this talk I will present MDS-Interactive, a visual interface technique for browsing collections of multidimensional data, such as product catalogues.

    Pieter Jan Stappers is associate professor of design techniques at Delft University of Technology. He obtained his master's degree in experimental physics at the KU Nijmegen in 1984, his Ph.D. at TU Delft in 1992, in a study of perceptual effects using head-mounted Virtual Reality systems. Current interests lie in the support of creativity and idea formation with (computer-aided) tools and techniques, and the design of interactive visual interfaces in general. He can be reached through http://www.io.tudelft.nl/id-studiolab/.

    Josephine L.C.M. Woltman Elpers (Groningen University)

    The Influence of Moment-to-Moment Pleasantness and Informativeness on Zapping TV Commercials: A Functional Data and Survival Analysis Approach.

    Although being a topic of considerable importance in the advertising management literature, zapping behavior and in particular its relationship with ad contents has received limited attention in marketing science. Our main contribution is to show how changes in two major dimensions of TV ad contents (pleasantness and informativeness) enhance and/or depress the probability of zapping when TV commercials are aired. It is argued that the zapping probability does not only depend on the levels of pleasantness and informativeness, but also on the velocities and the interaction between pleasantness and informativeness. We use multiple samples of judges who assess 18 commercials on intended momentary pleasantness and informativeness. We introduce a non-parametric functional data approach to obtain representative measures of the pleasantness and informativeness curves for each commercial. Local polynomial regression is applied to obtain first order derivatives to measure velocity of change of these measures over time. We perform principal component analysis to obtain an average pleasantness and informativeness curve for each ad that maximizes variance across individual momentary responses. Zapping data are collected in an experimental setting with 71 subjects. A random effect discrete time hazard model that allows for a general specification of the baseline hazard and differences across ads and subjects is estimated using simulated maximum likelihood. Ad uniqueness and ad familiarity are controlled for. It reveals the effects that both the level and velocity of moment-to-moment pleasantness and informativeness have on zapping probabilities. In support of our hypotheses we find that higher pleasantness levels of commercials decrease but that higher informativeness levels increase the zapping probability. The initial evidence derived from the significant positive interaction terms in our model of zapping suggests that pleasantness and informativeness are also incompatible at higher levels and positive velocities and increase the zapping probability. Implications for the design and pre-post testing of television commercials are also offered.

    Josephine Woltman Elpers (1974) studied business Economics at the University of Groningen. Her studys were focussed on marketing research and this lead to a Master's thesis on Multidimensional Scaling and Segmentation methods in Micronutrient Research in 1998. Since then Josephine became a part-time lecturer and part-time Phd student at the Department of Marketing and Market Research of the University of Groningen. The main courses thaught are Methods of Market Research and Retail Management. In 2001 she became a full-time Phd student and she works on the Measurement and Analysis of Attention for TV-commercials under the supervision of professor Dr. M. Wedel of the university of Groningen and professor Dr. R.G.M. Pieters of the University of Tilburg. In this project the attention to TV commercials is investigated through the fragment-to-fragment analysis of commercial contents, moment-to-moment analysis of emotional responses to TV commercials amd analysis of eye-movement recordings, as well as analysis of channel switching behavior. The objective it to build a theoretical and methodological framework to increase the understanding of zapping behavior during TV commercials. Besides her Pd.D work, she is interested in: Methods of Quantitative and Qualitative Market Research, Hazard modeling, Logit and mixturemodeling and Multidimensional Scaling


    Hilde Tobi (Groningen University)

    Measuring patient compliance using pharmacy data

    In the Netherlands, every pharmacist keeps record of all prescribed medication in one of the available computerized pharmacy systems. The focus of this presentation is to detect non-compliance with prescribed regimens of drug administration, using data from the pharmacy system. Once non-compliance is suspected the pharmacist may take action for example by contacting the patient or the prescriber. Summing the quantities delivered over a certain period of time and dividing this by the quantity that should have been used (based on dosing schedule and time) yields the refill ratio. A ratio larger than 1 would mean someone gets more medication from the pharmacy than he or she is supposed to according to the prescription. A ratio smaller than one represents that someone skips his or her medication. In pharmacy practice research the cut-points 0.9 and 1.1 are often used to classify someone as non-compliant. These cutting points are historically grown, or so it seems. No information on the shape of the distribution of the ratio was available, nor for the general population, nor for the users of a particular therapeutic group. Using the InterAction database that contained data from 12 pharmacies on medication use of approximately 100,000 patients, the refill ratio was investigated. The shape and location of the distribution was compared for two drugs: statines and sulfonylureumderivates. In addition, the distribution of the ratio was compared for men and women, and different age categories. It was also investigated whether calculating the ratio over half-a-year instead of one year makes a lot of difference.

    Hilde Tobi earned her MSc degree in educational technology in 1989 and became a research assistant in psychometrics (University of Twente). A Fullbright scholarship enabled her to study one year biostatistics and epidemiology at the School of Public Health of the University of South Carolina. In 1993, Hilde took a position as consultant biostatistician at the department of Clinical Epidemiology and Biostatistics at the Vrije University in Amsterdam. She has been the 'house-statistician' of several hospital groups. Her dissertation in 1999, entitled Some issues in applied statistics in clinical restorative dental research was the result of work she had done for pedodontology. In February 2000, Hilde joined Social Pharmacy and Pharmacoepidemiology at the University of Groningen as assistant professor pharmacoepidemiology and biostatistics. Her main research interests are children and medicines and methodology of pharmaco-epidemiological research. She teaches research methodology and participates in several (post) graduate courses.

    Bart Jan van Os (Leiden University)

    Globally Optimal Classification and Regression Trees

    Since their introduction, classification and regression trees have acquired popularity in many scientific fields. Such trees are usually constructed by growing the tree from the top, while optimizing each new split locally given previously acquired splits; the resulting trees may be improved upon by pruning or tree-averaging to pursue global objectives, but global optimality is not guaranteed. As a result, these trees may not be the best trees possible with respect to the misclassification rate or the residual sum of squared error in the terminal nodes. Here, the optimization of a classification tree with discrete predictors will be pursued through a global objective function defined on the classes formed by the terminal nodes, with size constraints on the tree. Although these size restrictions have to be applied a priori to solve the tree problem, they do not necessarily restrict the solution space of the final solution: the idea is to solve the problem for varying sizes, and to use cross-validation procedures to choose the best size restriction. The size constrained tree problems can be phrased as particular constrained partitioning problems defined on the measurement space, related to global optimal divisive hierarchical clustering. An exact exponential time Dynamic Programming algorithm for solving this problem will be given, related to an algorithm for globally optimal divisive hierarchical clustering. An example of a regression tree problem with eleven predictors for a sample of size 102 will be discussed, which concerns the prediction of anxiety decrease for panic disorder patients that received either cognitive therapy or a drug therapy.

    Bart Jan van Os is researcher at the Data Theory Group, Leiden University. He obtained his PhD in 2001 in Leiden for the thesis Dynamic Programming for Partitioning in Multivariate Data Analysis, that won the 2001 Psychometric Society Dissertation Award. His primary research interests are in combinatorial data analysis and combinatorial optimization.

    Jerome H. Friedman* (Stanford, USA)

    Weighted Harmonic Distance Clustering

    A dissimilarity measure for value--attribute data is proposed for use in cluster analysis. It assigns small dissimilarities to observation pairs that have close values on any subset of the attribute variables regardless of their values on the complement set of variables. Using this measure in conjunction with dissimilarity based clustering algorithms encourages the detection of subgroups of observations that preferentially cluster on subsets of the variables. The relevant variable subsets for each individual cluster can be different and partially (or completely) overlap with those of other clusters. Enhancements for increasing sensitivity for detecting especially low cardinality groups clustering on a small subset of variables are discussed. Applications in several different domains, including gene expression arrays, are presented.

    *Joint work with Jacqueline Meulman.

    Jerome Friedman is Professor of Statistics at Stanford Universit, and Leader of the Computation Research Group at the Stanford Linear Accelerator Center. His primary research interests are in Machine Learning and Data Mining. He has invented several of the statistical techniques commonly used in those fields.

    VOC-voorjaarsbijeenkomst 2001

    Robuustheid

    Universiteit van Antwerpen (UIA)

    Gebouw D9.10 (ochtend) en B0.32 (middag)

    Vrijdag 20 april 2001

    In samenwerking met de Universiteit Antwerpen (UIA) organiseert de VOC op 20 april een bijeenkomst met als thema "Robuustheid". In de voordrachten zal er aandacht besteed worden aan nieuwe robuuste methoden en hun eigenschappen. Daarbij zal nadruk gelegd worden op de toepasbaarheid van deze methoden, met name in datasets met veel uitbijters, met onder meer toepassingen op het gebied van financiën, extreme waarden statistiek, chemie, tijdreeksen en spatiële statistiek.

    Progamma

    • 10.00-10.30 Ontvangst
    • 10.30-11.10 J. Beirlant (Katholieke Universiteit Leuven)
    • 11.10-11.50 H. P. Lopuhaa (University Delft)
    • 11.50-12.30 P. Rousseeuw, S. Van Aelst and K. Van Driessen (University of Antwerp)
    • 12.30-14.00 Lunch
    • 14.00-14.40 A. Lucas (Vrije University of Amsterdam)
    • 14.40-15.20 W.J.J. Rey (Philips Research Laboratories Eindhoven)
    • 15.20-15.40 Koffie-thee
    • 15.40-16.20 C. Becker (University of Dortmund)
    • 16.20-16.45 Members meeting VOC
    • 16.45-later Cocktail

    Organisatie: Peter Rousseeuw (Peter.Rousseeuw@ua.ac.be)

    S.v.p. opgeven bij Paul Eilers.

    Routebeschrijving

    Met de trein: Reizigers die aankomen in Antwerpen Centraal nemen bus 17 tot halte Universiteitsplein of de halte erna. Reizigers met aankomst in Berchem Station nemen bus 21 eveneens tot halte Universiteitsplein of de halte erna. Zie ook website van de UIA. LET OP: De UFSIA ("Sint Ignatius") en de UIA campus zitten op verschillende locaties.

    Met de auto: Reizigers via Brussel volgen de E19 richting Antwerpen en nemen na uitrit 7 (Kontich) de speciale uitrit voorbehouden voor verkeer naar UIA/UZA. Reizigers via de Antwerpse ring volgen de E19 richting Brussel en nemen onmiddellijk na de Craeybeckxtunnel de speciale UIA/UZA. Vanaf deze volgt men de wegwijzers naar UIA, parking 2.

    Gedetailleerde informatie over de bereikbaarheid van de UIA campus kan men vinden via http://www.ua.ac.be/algemeen/index.html. LET OP: Verwar de locatie van de UIA-campus (aan de rand van Antwerpen) niet met de locatie van het UFSIA (in het centrum van Antwerpen). Ga dus niet naar UFSIA, maar naar de campus UIA.

    De bijeenkomst zal in de morgen worden gehouden in gebouw D, kelderverdieping, lokaal D9.10 en in de middag in gebouw B, gelijkvloers, lokaal B0.32

    Jan Beirlant, Katholieke Universiteit Leuven

    Extreme value statistics and robustness methods: an impossible marriage?

    In contrast to robust statistical methods, the first aim of extreme value methods is to model the most outlying observations in a data set. Indeed, at some instances these extreme observations appear the most valuable to practitioners. Here we can for instance refer to actuarial applications where the largest claims do of course demand all attention of the actuary.In civil engineering one does want to safeguard against the most extreme winds, water levels etc. Other important applications can for instance be found in finance (extreme losses must be estimated by law), meteorology (do minimal temperatures increase?), geology (earthquakes). The statistical methods that have been developed in this context over the last few decades are often not too robust. Recently however some new techniques have appeared to overcome some of these problems. Also, in multivariate extreme value analysis and regression analysis with emphasis on extreme value modelling, methods are being borrowed from recent robustness literature. We will give some examples of these unexpected interactions.

    Jan Beirlant is Professor of Statistics at the University of Leuven. He obtained his PhD in 1984 from the same university, and has held visiting positions at the University of Washington and the University of Paris VI. His primary research interests are in extreme value methods and nonparametric density and regression estimation. He has cooperated with colleagues in a variety of applications of statistics including insurance, finance, civil engineering and biochemistry. He is a Fellow of the International Statistical Institute. jan.beirlant@wis.kuleuven.ac.be

    Hendrik P. Lopuhaa, Technische Universiteit Delft

    Robustly weighted estimators for multivariate location and covariance

    In this talk I will discuss a weighted sample mean and sample covariance, where the weights are determined by the Mahalanobis distances with respect to initial robust estimators. The focus will be on the asymptotic behaviour of the weighted estimators, for which an explicit symptotic expansion is given. From this expansion it can be seen that reweighting does not improve the rate of convergence of the initial estimators. As an example we discuss the case where smooth S-estimators are used to determine the weights, in which case the weighted estimators areasymptotically normal. We will compare the efficiency and local robustness of the reweighted S-estimators with two other improvements of S-estimators: S-estimators and constrained M-estimators.

    Rik Lopuhaa is Associate Professor of Statistics in the Department of Control, Risk, Optimization, Systems and Stochastics at the Faculty of Information Technolgy and Systems at the Delft University of Technology, The Netherlands. He received his doctoral degree in 1986 at the University of Amsterdam and received his PhD in 1990 at the Delft University of Technology. Since 1990 he is lecturing at Delft University, and was visiting assitant professor at the University of Washington, Seattle in 1991. His primary research interest is distribution theory for non-parametric estimators in inverse problems, but he is also active in robust methods for multivariate data and extreme value theory, comprehensive definitions of breakdown points for dependent and independent data. H.P.Lopuhaa@its.tudelft.nl

    P. Rousseeuw, S. Van Aelst and K. Van Driessen

    Robust Multivariate Regression

    We construct a robust method for multivariate regression, based on the Minimum Covariance Determinant (MCD) estimate of the joint location and scatter matrix of the explanatory and response variables. The resulting method has the appropriate equivariance properties, a bounded influence function, and the same breakdown value as the initial MCD estimator. To increase the efficiency we propose a reweighted estimator, which was selected from several possible reweighting schemes. Simulations show that the asymptotic properties of robustness and efficiency remain valid at finite samples. The method does not need much computation time, and is applied to chemical engineering data.

    Peter Rousseeuw is Professor of Statistics at the University of Antwerp, Belgium. He gained his PhD in 1981 and has held academic posts at Technical University Delft and University of Fribourg, as well as visiting positions at various universities, including Berkeley. He is the author of 3 books and over 120 articles in statistics. His primary research interests are robust methods, cluster analysis and depth functions. His work has applications in a wide variety of fields such as Analytical Chemistry, Finance, Computer Vision and Medicine. Stefan Van Aelst is Research Assistant of the Fund for Scientific Research-Flanders and gained his PhD in 2000. Katrien Van Driessen is assistant at the Department of Applied Economics, University of Antwerp. Peter.Rousseeuw@ua.ac.be. See also http://win-www.uia.ac.be/u/statis.

    André Lucas, Vrije Universiteit, Amsterdam

    Comprehensive definitions of breakdown points for dependent and independent data

    We provide a new definition of breakdown in finite samples with an extension to asymptotic breakdown. Previous definitions center around defining a critical region for either the parameter or the objective function. If for a particular outlier constellation the critical region is entered, breakdown is said to occur. In contrast to the traditional approach, we leave the definition of the critical region implicit. Our definition encompasses all previous definitions of break-down in both linear and non-linear regression settings. In some cases, it leads to a different notion of breakdown than other procedures available. An advantage is that our new definition also applies to models for dependent observations (time-series, spatial statistics) where current breakdown definitions typically fail. We illustrate our points using examples from linear and non-linear regression as well as time-series and spatial statistics.

    André Lucas is Associate Professor of Finance at the Vrije University in Amsterdam. He obtained his PhD in outlier robust time series analysis at the Econometrics department of Erasmus University Rotterdam. His research interests and publications include both theoretical work on non-stationary time-series analysis and more applied work in the area of financial economics. alucas@econ.vu.nl

    William J.J. Rey, Philips Research Laboratories, Eindhoven

    Robust smooth Karhunen-Loeve expansions of experimental curves

    A chemical system is stressed and, then, it relaxes in a complicate manner where it comes back to an equilibrium. During the equilibrium recovery, relaxation curves are measured and we report on the principal component analysis of this (dirty) data set. The main issues are with respect to the outlyingness of each of the curves and to our desire of obtaining smooth principal components. Each of the relaxation curves is unique; during the experience, the chemical system evolves and the equilibrium states differ. Nevertheless, in spite of a very important noise, a curve pattern is clearly visible. Can we get this pattern out by decomposing these relaxations curves into orthogonal components? These orthogonal components are easily worked out in the least squares set-up; unfortunately, they depend heavily on the curve noise and are little smooth. The sensitivity to the noise can be reduced by robustification. How to make the principal component curves smooth is much more tricky.

    William Rey is Principal Scientist with Philips Research at Eindhoven, the Netherlands. At the beginning of his career as a statistician, he has been confronted with the treatment of "dirty" data sets collected in cardiology; this chance circumstance led to his study of analysis methods that are outlier resistant. From this time on, he maintains a balance between the theoretical and the practical issues of robustness. He is the author of 3 books and some 50 papers. What he did during the more recent period is essentially advising on statistical matters; he often compares his activity to the consultation of a (hopefully good) general practitioner, being supposed to know enough of everything (and to refer to the true specialist) with respect to industrial statistics. william.rey@philips.com or rey@natlab.research.philips.com

    Claudia Becker, University of Dortmund

    Effects of outliers on the analysis of high-dimensional data

    Outliers in datasets can affect statistical procedures in various ways. For high-dimensional data, the effects of outliers may be different compared to the lower-dimensional case. Moreover, the consequences of the occurrence of outliers are not even all known because the impact of spurious observations becomes less transparent with growing complexity of models and methods. Different methods such as outlier identification procedures, principal component analysis and sliced inverse regression are discussed with respect to the outlier problem and to the development of robustified versions. Developing such robust procedures becomes harder with increasing complexity of the data structure.

    Claudia Becker is Assistant Professor at the Department of Statistics, University of Dortmund, Germany, where she also got her PhD in 1996. Her main research interest is in robust statistical methods for multivariate data. As a member of the Collaborative Research Centre "Reduction of Complexity for Multivariate Data Structures" at the University of Dortmund. She is working in the field of the analysis of high-dimensional data, with a special focus on robust methods for dimension reduction. cbecker@statistik.uni-dortmund.de


    VOC-najaarsbijeenkomst 2000

    Nonparametrie en niet-lineariteit

    Tilburg, Gebouw A - Zaal 186

    Vrijdag 1 december 2000

    Programma

    • 10.00-10.30 Ontvangst
    • 10.30-11.00 B. Melenberg (University of Tilburg) Overview of nonparametrics
    • 11.00-12.00 B. Silverman (University of Bristol) Empirical approaches to wavelet smoothing
    • 12.00-13.00 Lunch
    • 13.00-14.00 I. Gijbels (Catholic University of Leuven) Nonparametric tests for monotonicity of a regression mean
    • 14.00-14.45 M. Timmermans (University of Groningen) Simultaneous component models with smoothness constraints of multivariate time series of a number of subjects
    • 14.45-15.15 Koffie/thee
    • 15.15-16.00 A. van Soest (University of Tilburg) Nonparametric modeling of the anchoring effect in an unfolding bracket design
    • 16.00-16.45 B. Donkers (Erasmus University) A consumer-theory-consistent semiprametric estimator of Engel curves

    S.v.p vooraf inschrijven bij H.P.A.M.vdnBorne@kub.nl

    Vind de routebeschrijving hier

    Organisatie: Harald van Heerde en Tammo Bijmolt (KUB)


    Abstracts

    Overview of nonparametrics
    Bertrand Melenberg, University of Tilburg

    The presentation gives an overview of the field of nonparametrics, shows the positions of the other five presentations within this field, and indicates other ?hot? research areas in nonparametrics.

    Bertrand Melenberg is Associate Professor at the Econometrics department of Tilburg University.


    Empirical Bayes approaches to Wavelet Smoothing
    Bernard Silverman, Institute for Advanced Studies in the University of Bristol, England.

    One way of dealing with the notion that an unknown function has an economical wavelet expansion is to model the wavelet expansion with a suitable mixture prior distribution. The parameters in this distribution can themselves be estimated from the data. Such a procedure has very attractive theoretical properties and performs excellently in practice. Applications both to standard wavelet regression and to spatial smoothing from irregular data will be considered and discussed.

    Bernard Silverman is Professor of Statistics and Provost of the Institute for Advanced Studies in the University of Bristol, England. He gained his PhD in 1978 from Cambridge University, and has held academic posts at Oxford and Bath Universities, as well as visiting positions at various universities, especially Stanford, where he is a frequent visitor. He is the author of 5 books and about 100 articles in learned journals. His primary research interest is in smoothing methods in statistics, but he has worked in a wide variety of fields, ranging from probability theory to applications of statistics in social science, industry and medicine. He is a Fellow of the Royal Society and is currently President of the Institute of Mathematical Statistics.


    Nonparametric tests for monotonicity of a regression mean
    Irène Gijbels, Institute of Statistics, Catholic University, Louvain.

    The potential monotonicity of a response variable in relation to a covariate is often of significant practical interest. For example, econometric theory predicts that production costs is a nondecreasing function of production output, and a monotone link between the levels of two medical symptoms is an important indicator of a common cause.
    In this talk we discuss recently proposed procedures for testing for monotonicity of a regression function. Among the possible approaches we mention a test based on the size of a ?critical? bandwidth (the amount of smoothing necessary to force a nonparametric regression estimate to be monotone), tests based on running gradients, and tests based on signs of differences. The first approach leads to a test that is analogous to Silverman?s test of multimodality in density estimation. Bootstrapping is used to provide a null distribution for the test statistic. We give some examples to demonstrate the testing procedure. This test suffers from some difficulties in certain situations, and the running gradient approach has been proposed to get better power properties in these situations. The tests based on signs of differences have a guaranteed level and are quite robust against heavy-tailed error distributions. Some simulation results illustrate the performance of these testing procedures.

    Irène Gijbels is Professor of Statistics at the Institute of Statistics of the Catholic University of Louvain, Louvain-la-Neuve, Belgium. After having obtained her PhD in 1990, she was Visiting Professor at the University of North Carolina at Chapel Hill and held a research position at the National Science Foundation. After having spent some time at the Mathematical Sciences Research Institute in Berkeley, California, she took up her current academic position at the Institute of Statistics. She wrote a book on local polynomial modelling with Prof. J. Fan of the University of North Carolina. Her main research interest is in nonparametric functional estimation, with emphasis on nonparametric regression, change point problems, deconvolution problems and hazard estimation. She is Associate Editor for two international statistics journals.


    Simultaneous Component Models with Smoothness Constraints of Multivariate Time Series of a Number of Subjects.
    Marieke Timmermans, Heymans Institute, University of Groningen

    A class of four types of Simultaneous Component Analyses (SCA) for modelling multivariate time series collected from more than one subject is discussed. Both intra-individual and inter-individual variability is covered in the models. The models are the SCA-P model, direct fitting PARAFAC2, and two newly proposed direct fitting variants of the INDSCAL model and the SUMPCA model. In each of the models, the multivariate time series of each subject is decomposed into a loading matrix, which is common to all subjects, and series of subject specific component scores. The four models can be ordered hierarchically from weakly to severely constrained, thus allowing for big to small interindividual differences in the model. The interpretation of the components is based on the loading matrix. The component score series reveal the latent data structure in the course of time. To improve estimation of the structural part of the data, and interpretability of the model, one may impose smoothness constraints on the component score series. The use of B-splines to constrain the models will be discussed, and it will be shown that smoothing the data before performing an unconstrained SCA lead to equivalent estimates as when a constrained SCA is performed. The use of the models is illustrated by an empirical example.

    Marieke Timmerman (1972) studied psychology at the University of Groningen. After the general program, her studies were focussed on neuropsychology and methodology. This lead to a Master?s thesis on ?Slowness of information processing after a closed head injury? in 1995 and on ?Missing data? in 1996. Since 1996, she is working on developments and applications of component models for multivariate multi-subject longitudinal data at the Heymans Institute of the University of Groningen under the supervision of Henk A.L. Kiers and Jos M.F. ten Berge. Besides her Ph.D. work, she is involved in teaching introductory statistics courses, and in statistical consulting.


    Nonparametric modeling of the anchoring effect in an unfolding bracket design
    Arthur van Soest, Tilburg University

    Household surveys are often plagued by item non-response on economic variables of interest like income, savings or the amount of wealth. Various papers by Manski show how, in the presence of such non-response, bounds on conditional quantiles of the variable of interest can be derived, allowing for any type of non-random response behaviour. Including follow up categorical questions in the form of unfolding brackets for initial item non-respondents, is an effective way to reduce complete item non-response. Recent evidence, however, suggests that such a design is vulnerable to a psychometric bias known as anchoring effect. In this paper, we extend the approach by Manski to take account of the information provided by the bracket respondents. We derive bounds that do and do not allow for the anchoring effect. These bounds are applied to earnings in the 1996 wave of the Health and Retirement Survey (HRS). The results show that the categorical questions can be useful to increase precision of the bounds, even if anchoring is allowed for.

    Arthur van Soest has been Professor in Econometrics at Tilburg University, the Netherlands, since 1995. Previously he was Assistant Professor and Associate Professor at the same institute. From 1992 until 1997, he had a research fellowship of the Netherlands Royal Academy of Arts and Sciences. He has a Ph.D. in Econometrics from Tilburg University, where his advisors were Arie Kapteyn and Peter Kooreman. His research interests cover microeconometrics (limited dependent variable models, panel data, semi- and nonparametrics), labor economics (participation and labour supply, formal versus informal sector employment, wage structures), consumption and saving behaviour (income expectations, demand systems, portfolio choice), and economic psychology (risk aversion, time preferences, anchoring, non-expected-utility models). He has published papers on labour supply, formal and informal sector employment, wage differentials, income expectations, consumer demand, portfolio choice, etc. in various journals, including the Journal of Econometrics, Journal of Applied Econometrics, European Economic Review, The Review of Economics and Statistics, Journal of Economic Behaviour and Organization, Journal of the American Statistical Association, Labour Economics, and Journal of Human Resources.


    A consumer-theory-consistent semiparametric estimator of Engel Curves
    Bas Donkers, Erasmus University Rotterdam, Department of Marketing and Organization.

    This paper considers the implication of semiparametric methods, particularly, the multi-index models, in the empirical analysis of consumer demand. The multi-index model is used to estimate Engel Curve relationships in rural China. To avoid incohereny with consumer theory, the impacts of total expenditures and of household composition on the expenditure share are modeled with two seperate indices. This is shown to provide a useful way for analyzing consumption patterns of households with a different demographic composition.

    Bas Donkers is currently working at the department of Marketing and Organization of Erasmus University, Rotterdam. His research interests are in the econometric modeling of individual behavior, based on economic or psychological theory.




    VOC-voorjaarsbijeenkomst 2000

    Classificatie met mengselmodellen

    Senaatszaal, Academiegebouw

    Broerstraat 5, Groningen

    Vrijdag 7 april 2000



    Introduction
    Programma
    Terug naar lijst met bijeenkomsten.

     

    Classificatie met mengselmodellen

    De voorjaarsbijeenkomst werd georganiseerd samen met SOM (Graduate School/ Research Institute, System, Organization, & Management). De bijeenkomst ging over classificatie met mengselmodellen, een thema dat, tegen de achtergrond van zijn populariteit in de afgelopen jaren, binnen de VOC toch iets onderbelicht is geweest. De Groningse organisatie wist een aantal vermaarde en interessante sprekers te strikken. De bijeenkomst bevatte tevens een aantal nieuwe aspecten. Zo waren de koffie- en theepauzes verlevendigd met demonstraties van een drietal programma's voor classificatie met mengselmodellen en lagen er boeken over mengselmodellen ter inzage. De bijeenkomst werd ondersteund door een tweetal softwareleveranciers: S-Plus en ProGamma.

     

    Terug naar begin aankondiging voorjaarsbijeenkomst 2000.

    Terug naar lijst met bijeenkomsten.

     

    Programma

    Terug naar begin aankondiging voorjaarsbijeenkomst 2000.

    Terug naar lijst met bijeenkomsten.

     

    Recent developments in mixtures
    D. Mike Titterington (Department of Statistics, University of Glasgow, Glasgow, Schotland)

    The talk will present a wide review of recent research on mixture distributions and generalisations thereof. After the underlying framework has been established, recent progress on some of the most important current issues will be outlined, in both frequentist and Bayesian inference. For example, various recent approaches will be described to the question of assessing how many components are present in an underlying mixture distribution. Mention will also be made of generalisations of the mixture model, including hidden Markov chains, hidden Markov random fields and the hierarchical mixtures of experts model. The extent to which inference becomes more complicated with these models will be discussed. This will lead on to a discussion of a body of material in the neural-computing literature, where mixture-type models and their analysis have attracted recent attention and some new ideas have been created.

    Some relevant references

    • Hobert, J.P., Robert, C.P., and Titterington, D.M. (1999). On perfect simulation for some mixtures of distributions. Statist. Comp., 9, 287-298.
    • Jordan, M.I. and Jacobs, R.A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6, 181-214. Robert, C.P., Rydén, T., and Titterington, D.M. (2000). Bayesian inference in hidden Markov models through the reversible jump Markov chain Monte Carlo method. J. R. Statist. Soc. B, 62, 57-75.
    • Rydén, T. and Titterington, D.M. (1998). Computational Bayesian analysis of hidden Markov models. J. Comp. Graph. Statist., 7, 194--211.
    • Titterington, D.M. (1990). Some recent research in the analysis of mixture distributions. Statistics, 21, 619-641.
    • Titterington, D.M. (1996). Mixture distributions (update). In: S. Kotz, C.B. Read, and D. Banks (Eds.), Encyclopedia of Statistical Science Update (Volume 1, pp.399-407). New York: Wiley.

     

    Finite mixture modeling of developmental processes: theory and applications
    Han van der Maas (Afdeling Ontwikkelingspsychologie, Universiteit van Amsterdam)

    In developmental psychology typologies play an important role. In solving cognitive tasks, children are supposed to use different strategies in different phases of their development. Traditionally, classification of children in strategies or developmental stages occurred in ad hoc and subjective ways. The availability of (software for) finite mixtures models makes it possible to fundamentally improve the research practice in developmental psychology. The application of finite mixture modeling to developmental tasks has only just begun and meets with a number of typical difficulties, regarding lack of specific theory, small sample sizes, and unreliable measurements. Yet, important progress has been made, both theoretically as well as technically. In this talk a theoretical justification of the use of typologies in developmental psychology will be discussed. This justification is based on the use of nonlinear dynamic models for the phase transitions that occur in stage-wise development. This will be illustrated with some successful applications of latent class modeling and multivariate mixture modeling of two famous cognitive developmental tasks, conservation and the balance scale.

     

    Why mixture models love genetics
    Ritsert Jansen (Centrum voor Biometrie, Wageningen)

     

    Bayesian selection and testing of latent class models
    Hans Berkhof (Departement Psychologie, Universiteit van Leuven)

    Specifying a latent class model involves the selection of a model from a set of alternatives. We discuss the Bayes factor as a selection tool and compare it to other selection tools including the deviance information criterium. Attention will be given to computational aspects as well as to prior sensitivity. Next, we consider goodness-of-fit testing using a posterior predictive checking procedure. An attractive feature of posterior predictive checking is that any function of the data and the model parameters can be used as a test quantity. This allows us to define test quantities that are most relevant from a substantive viewpoint. Posterior predictive checking is sometimes criticized for rendering conservative test results; we conclude by presenting a method for reducing the conservatism of the test.

    Johannes Berkhof graduated at the Department of Econometrics, University of Groningen in 1993. In the period 1993-1997, he worked as a PhD student at the Department of Statistics and Measurement Theory. He will defend his PhD thesis entitled "Specification methods for the multilevel model" in May 2000. Since 1998, he is researcher at the Department of Psychology, University of Leuven, working on posterior predictive checking procedures. His main research interests are Bayesian statistics, mixed-effects models, and nonparametric regression.

     

    Prediction and classification when diagnostic classes are related
    Emmanuel Lesaffre (Biostatistical Centre for Clinical Trials, U. Z St. Rafail, B-3000 Leuven, Belgium),Geert Molenberghs (Centrum voor Statistiek, Limburgs Universitair Centrum, Diepenbeek), llse Schey (Biostatistical Centre for Clinical Trials, U. Z St. Rafail, B-3000 Leuven, Belgium)

    We consider prediction and classification into diagnostic classes which consist of individuals who can suffer from multiple diseases. For instance, in a cardiovascular context a patient can need bypass surgery, or a valve replacement, or both. The popular multigroup logistic model is suitable for prediction into nominal classes, but does not employ the underlying structure of the classes. Hence, this model is not entirely suitable for this situation. Also, computational difficulties often occur with the multigroup logistic model when the classes are of the above nature. A modified form of the model, applicable to some economic applications, is not appropriate for most medical applications. Instead, we suggest the n-way Dale model, also called the marginal logistic model. It is shown that this model is computationally more stable, although more involved, and allows better interpretation of the parameters. To illustrate our ideas the POPS data set is taken, where the child's abilities at the age of 2 is predicted from risk factors at delivery. A simulation study is performed to indicate the gain in classification ability in comparison with the multigroup logistic model. It is also shown that in terms of the parameter estimates the Dale model is more sensitive to the choice of the sampling scheme than the multigroup logistic model.

    Keywords: Classification; Cross-ratio; Exclusiveness; Exhaustiveness; Marginal model; Multigroup logistic model; Prediction; Separate sampling

    Reference: Computational Statistics & Data Analysis, 25 (1997), 67-90

     

    Softwaredemonstraties

    Latent Gold -
    Jeroen Vermunt (Vakgroep Methoden en Technieken, Katholieke Universiteit Brabant)

    Panmark - Frank van der Pol (CBS)

    GLIMMIX - Michel Wedel (Vakgroep Marktkunde en Marktonderzoek, Rijksuniversiteit Groningen)


     

    VOC jubileumcongres 1999

    "Alles op zijn tijd"

    Kerkrade, Congrescentrum Rolduc

    Donderdag 18 - vrijdag 19 november 1999



    Introduction
    Programma en abstracts

    "Alles op zijn tijd"

    Bij de viering van het tienjarig bestaan van onze vereniging was het thema "Alles op zijn tijd" zeker op zijn plaats. Op ons jubileumcongres is aandacht besteed aan allerlei aspecten van ordinatie en classificatie van en met tijdgebonden gegevens. We zijn ruimhartig geweest in de keuze van onze onderwerpen en de rol die de tijd er in speelt. Drie deelthema's karakteriseerden de drie dagdelen van het congres:

    - classificatie van tijdreeksen;
    - "knutselen met curven";
    - reconstructie van een onbekende tijdsorde.

    We beschrijven ze hieronder in het kort.
    Het eerste deelthema ving aan met een inleiding van Tim Cole over het modelleren van referentiecurven voor de gewichten van kinderen in de groei. Dré Nierop sloot hierop aan met een beschrijving van modellen voor groeicurven van relatief zeer kleine kinderen. Groeicurven zijn een vorm van herhaalde metingen aan individuen. Geert Verbeke besprak hoe je in regressiemodellen voor zulke gegevens subgroepen detecteert en daar de individuen aan toewijst. Maar we groeien niet alleen, we keren ook weer tot stof terug. De oorzaken van overlijden zijn niet altijd met hetzelfde classificatiesystemen genoteerd. Judith Wolleswinkel boog zich over het probleem van het aan elkaar passen van tijdreeksen met verschillende classificatieschema's.

    Onder het tweede deelthema kwamen enkele zeer verschillende toepassingen aan de orde: Jim Ramsay presenteerde functionele data-analyse, waarin curven (in tijd en plaats) en hun snelheid van veranderen bestudeerd worden; eventueel wordt de tijdas vervormd. Garmt Dijksterhuis en Paul Eilers borduurden voort op dit thema: zij beschreven hoe je reeksen sensorische metingen (reuk en smaak) beter kunt modelleren door de tijdas te rekken of in te krimpen. Sabina Bijlsma werkt met fluorescentiemetingen, die per tijdstip een matrix van intensiteiten opleveren. Uit een reeks van dergelijke matrices kun je met driewegmethoden het verloop van chemische reacties bepalen. Loe Boves behandelde automatische herkenning van spraak en sprekers. Daarbij gaat het om een samenspel van in de tijd veranderende toonhoogten en sterkten.

    Het derde deelthema ging over het reconstrueren van de volgorde -- en waar mogelijk de tijdstippen -- van gebeurtenissen uit kenmerken van voorwerpen of organismen. Jeroen Poblome beschreef dit, vanuit een archeologische context, voor de seriatie van artefacten. Hij stelde gegevens beschikbaar aan Patrick Groenen, die methoden en resultaten besprak. Phylogenetische bomen representeren evolutionaire vertakkingen in biologische groepen. De absolute tijdsschaal is meestal niet te achterhalen, maar de volgorde van de vertakkingen wel, waardoor uitspraken mogelijk zijn over hoe veel stappen je terug moet gaan om een gemeenschappelijke "voorouders" te vinden. Rino Zandee beschreef hoe dit in zijn werk gaat.
    Wij menen dat dit gevarieerde programma in de geest van de VOC was: aandacht voor methoden en voor toepassingen, met een brede blik en interdisciplinair. Het was een inspirerende en gezellige bijeenkomst

    De jubileumcommissie

    Stef van Buuren
    Paul Eilers
    Caspar Looman
    Iven van Mechelen

     

    Terug naar begin aankondiging jubileumcongres 1999.

     

    Programma jubileumcongres 1999

    Donderdag 18 november

    • 13:00­13:45 Ontvangst met koffie
    • 13:45­14:00 Welkom en opening

    Sessie I: Referentiecurven en classificatie van en met tijdreeksen

    • 14:00 ­ 15:00 Tim Cole (Institute of Child Health, London)
    • 15:00 ­ 15:30 Theepauze
    • 15:30 ­ 16:10 Dré Nierop (MUVARA, Leiderdorp)
    • 16:10 ­ 16:50 Judith Wolleswinkel (iMGZ, Erasmus Universiteit Rotterdam)
    • 16:50 ­ 17:30 Geert Verbeke (Biostatistical Centre, Katholieke Universiteit Leuven)
    • 17:30 ­ 18:30 Happy hour in de bar
    • 19:00 ­ 20:30 Dinerbuffet

    Vrijdag 19 november

    Sessie II:
    Knutselen met curven

    • 9:00 ­ 10:00 Jim Ramsay (Psychology Department & Department of Mathematics and Statistics, McGill University, Montreal, Canada)
    • 10:00 ­ 10:30 Koffiepauze
    • 10:30 ­ 11:10 Garmt Dijksterhuis (Department of Dairy and Food Science, Royal Veterinary and Agricultural University, Denmark) en Paul Eilers (Medical Statistics Department, Leiden University Medical Centre)
    • 11:10 ­ 11:50 Sabina Bijlsma (Process Analysis & Chemometrics research group, University of Amsterdam)
    • 11:50 ­ 12:30 Loe Boves (Katholieke Universiteit Nijmegen
    • 12:30 ­ 13:30 Lunch

    Sessie III: Reconstructie van een onbekende tijdsorde

    • 13:30 ­ 14:10 Jeroen Poblome (Fonds voor Wetenschappelijk Onderzoek - Vlaanderen)
    • 14:10 ­ 14:50 Patrick Groenen (Data Theory Group, Department of Education, Leiden University)
    • 14:50 ­ 15:30 Rino Zandee (sectie Theoretische Biologie en Phylo-genetische Systematiek, Instituur voor Evolutionaire en Ecologische Wetenschappen, Universiteit Leiden)
    • 15:30 ­ 16:00 Afscheidskoffie of -thee

    Terug naar begin aankondiging jubileumcongres 1999.

    Terug naar lijst met bijeenkomsten.

    Abstracts jubileumcongres 1999

    Classification of infant weight gain using weight charts
    Tim Cole (Institute of Child Health, London)

    Human growth is at its most rapid in infancy, and weight faltering can have a lasting effect on future health. So identifying weight faltering is a priority at this age. Yet the weight charts conventionally used to measure growth do not measure growth as such, only attained weight. The talk will show how to assess weight gain on the weight chart, using an approach that involves converting weight to a standard deviation score (SDS) adjusted for skewness in the weight distribution. The assessment of the change in weight SDS between two ages then depends only on the correlation between weight SDS at the two ages. The talk will discuss some of the problems of interpretation that arise in classifying weight charts of individual children.

    Tim Cole is professor of medical statistics in the Institute of Child Health in London. His publication list includes over 300 references in statistics, growth and nutrition. Cole is the inventor of the LMS-method, a widely used technique for constructing growth reference curves, the Growth Slide Ruler, a patented measurement device to assess growth in children, and the 3-in-1 reference chart. Cole is a member of several editorial boards, including Statistics in Medicine and British Medical Journal, and is a member of the Council of the Royal Statistical Society and the Steering Committee of the WHO Multicentre Growth Reference.

    Terug naar programma-overzicht.

    Over de ontwikkeling van een groei-respons model voor relatief heel kleine kinderen.
    Dré Nierop (MUVARA, Leiderdorp)

    Het International Pediatric Growth Research Centre te Gøteborg, Zweden is een onderzoekscentrum waar kinderen worden geobserveerd en behandeld, die ver achter blijven in hun lichamelijke uitgroei. Deze kinderen krijgen een groeihormoonbehandeling die jaren in beslag kan nemen. Van deze kinderen worden voorafgaand aan de behandeling allerhande mogelijke gegevens verzameld, die van tevoren al een beeld zouden moeten geven van hoe een kind op de behandeling zal gaan reageren. Met deze informatie zou de dosering van de medicijnen precies op het kind afgestemd kunnen worden. Maar het bleek echter moeilijk om een duidelijke lijn uit de onderzoeksresultaten te halen met betrekking tot de afstemming van de dosering. In de afgelopen 5 jaar is in samenwerking met de medische specialisten een groei-respons model ontwikkeld. In de lezing zal dit ontwikkelingsproces beschreven worden. Eerst wordt het gezamelijk modeleren van alle individuele respons-curven besproken. De resulterende respons-parameters geven een vrij nauwkeurige beschrijving van de groeitoename tijdens de eerste 4 jaar van de groeihormoonbehandeling. Vervolgens wordt geprobeerd de respons-parameters te voorspellen met gegevens, die voorafgaand aan de behandeling beschikbaar zijn. Hiervoor zijn allerlei niet-lineaire technieken ontwikkeld en toegepast.

    Dré Nierop is afgestudeerd bioloog. Sinds 1976 is hij bezig met de toepassing en ontwikkeling van nieuwe multivariate statistische methoden. Hij heeft vele jaren gewerkt op dit gebied ten behoeve van een groot aantal bedrijven en instellingen, onder meer voor het Max Planck Instituut, de Nederlandse Organisatie voor Wetenschappelijk Onderzoek, het Praeventiefonds, het Nederlands Instituut voor het Dove en Slechthorende Kind, het Nederlands Astmafonds, het Centrum voor Bio-Farmaceutische Wetenschappen, TNO Voeding. Hij promoveerde in 1993 bij de vakgroep Datatheorie van de Rijksuniversiteit Leiden, Faculteit Sociale Wetenschappen. Omdat hij zag dat er op de markt weliswaar geen gebrek was aan statistische pakketten, maar wél aan adequate afstemming tussen statistiek en inhoudelijke vragen van bedrijven en instellingen, is hij in 1994 voor zichzelf begonnen en gestart met zijn bedrijf MUVARA (http://www.muvara.nl/).

    Terug naar programma-overzicht.

    Herclassificatie van doodsoorzaken omlange termijn sterfte trends te analyseren.
    Judith Wolleswinkel (iMGZ, Erasmus Universiteit Rotterdam)

    In de presentatie wordt een methode beschreven om doodsoorzaken te herclassificeren om sterfte naar doodsoorzaak te kunnen bestuderen in Nederland voor de periode 1875-1992. Deze herclassificatie is nodig, omdat er in de periode 1875-1992 10 verschillende doodsoorzaakclassificaties in gebruik zijn geweest nl. 9 versies van de International Classification of Diseases and Causes of death en 1 daarvoor in gebruik zijnde 19e eeuwse Nederlandse doodsoorzakenclassificatie.

    Bij de herclassificatie moest aan twee criteria voldaan worden. Ten eerste wilden we voldoende detail behouden in de doodsoorzaakcategorieën om zowel sterfte aan infectieziekten en niet-infectieziekten te kunnen bestuderen. Ten tweede moeten de gecreëerde categorieën nosologisch continu zijn door de tijd d.w.z. de ziekte-inhoud van de categorie moet over de tijd gelijk blijven.

    We hebben gebruik gemaakt van een methode ontworpen door Franse demografen die bestond uit het construeren van 'tweezijdige correspondentie-tabellen' en 'fundamentele associaties'. De vervolgens ontstane categorieën werden getest op statistische continuiteit gedurende de overgang van de ene classificatie naar de volgende.

    Het resultaat was een geneste doodsoorzaakclassificatie met 27 doodsoorzaken die voor de gehele periode 1875-1992 bestudeerd konden worden, 65 doodsoorzaken die voor de periode 1901-1992 bestudeerd konden worden en 92 doodsoorzaken die vanaf 1931 bestudeerd konden worden. Daarbij was er voldoende detail naar infectie- en niet-infectieziekten. De gebruikte methode bleek zeer bruikbaar voor herclassificatie van doodsoorzaken om nosologisch continue categorieën te creëren.

    Judith Wollewinkel-van den Bosch heeft Bio-Medische Wetenschappen gestudeerd aan de Rijksuniversiteit Leiden. Vervolgens is zij in dienst gekomen bij het Instituut Maatschappelijke Gezondheidszorg (iMGZ) van de Erasmus Universiteit Rotterdam. Zij heeft daar gewerkt aan een project betreffende 'het voorkomen van ziekten van het zenuwstelsel in Nederland'. Zij was destijds gedetacheerd op het CBS. Vervolgens is zij haar promotieonderzoek gestart over de 'epidemiologische transitie in Nederland'. In november 1998 is zij daarop gepromoveerd. Inmiddels werkt zij nog steeds bij iMGZ aan een project naar 'de kwaliteit van perinatale zorg in Nederland.' Zij is geregistreerd als epidemioloog A en heeft de MSc-opleiding epidemiologie van het NIHES afgerond.

    Terug naar programma-overzicht.

    A linear mixed effects model with heterogeneity in the random effects population.
    Geert Verbeke (Biostatistical Centre, Katholieke Universiteit Leuven)

    A popular model for the analysis of longitudinal data is the linear mixed model. Using subject-specific regression coefficients (random effects), it is possible to explicitly model the belief that not all variation in subject-specific longitudinal profiles can be explained through covariates. Unfortunately, as will be shown in this presentation, the empirical Bayes estimates (EB), classically used to estimate the subject-specific regression coefficients, do not necessarily reflect the correct heterogeneity in the population, and can therefore not be used to detect sub-groups of patients which evolve differently over time, nor for the classification of patients in such clusters. To this end, the classical linear mixed model will be extended in order to explicitly allow the presence of heterogeneity in the random-effects population. This not only yields a formal testing procedure for the presence of heterogeneity, it also provides a formal rule for classifying subjects in any of the detected clusters. All of this is illustrated on two practical examples.

    Geert Verbeke is assistant professor at the biostatistical centre of the Katholieke Universiteit Leuven. He received the B.S. degree in mathematics (1989) from the Katholieke Universiteit Leuven, the M.S. in biostatistics (1992) from the Limburgs Universitair Centrum, and earned a Ph.D. in biostatistics (1995) from the Katholieke Universiteit Leuven. He wrote his dissertation, as well as a number of methodological articles, on various aspects of linear mixed models for longitudinal data analysis. He has held visiting positions at the Gerontology Research Center and the Johns Hopkins University (Baltimore, MD).

    Terug naar programma-overzicht.

    Classifying handwriting samples using principal differential analysis.
    Jim Ramsay (Psychology Department, Department of Mathematics and Statistics, McGill University, Montreal, Canada)

    Perhaps the main distinguishing feature of functional data is the possibility of working with derivatives. Moreover, it may be more effective to use derivatives to describe and classify curves than to use the curve values themselves. Indeed, it is often the pattern of acceleration observed in the second derivative that carries the important information about what determines curve shape.

    Principal differential analysis (PDA) is a method for estimating a differential equation that describes both the mean and the variation in curve characteristics as a function of time. A differential equation is particularly useful in that it simultaneously models variation in the curves and a number of their derivatives.

    These two concepts are combined in the analysis of samples of handwriting dynamics by two individuals. Pen position in 3D as a function of time is analyzed using PDA, and the estimated differential equation not only captures the essentials of the dynamics of the scripts, but also discriminates cleanly between those produced by different persons.

    Jim Ramsay is professor at the Psychology Department, and associate member of the Department of Mathematics and Statistics, at McGill University in Montreal, Canada. He teaches multivariate statistics and psychometrics. He did (and does) a lot of research in these fields, leading to numerous articles in the scientific literature. He is active as an editorial and business consultant. He is member of many professional organisations and served on several boards. He received many academic honours and rewards. Presently he is very active in smoothing methods and the analysis of curves, images and other types of functions, as published in the book "Functional Data Analysis", which he wrote together with Bernard Silverman.

    Terug naar programma-overzicht.

    Modeling Time-Intensity data in dynamic sensory research.
    Garmt Dijksterhuis (Department of Dairy and Food Science, Sensory Science section, Royal Veterinary and Agricultural University, Denmark) and Paul Eilers (Medical Statistics Department, Leiden University Medical Centre)

    Sensoric research of food and drinks investigates the responses of people to aspects like smell, taste and structure. Many measurements are dynamic: panelists indicate the strength of the sensation with a computer mouse or a sliding knob. The result is called a Time-Intensity (T-I) curve.

    Standard methods of multivariate analysis do not work well with T-I curves, because they do not exploit the dynamic character of the data. We describe two models for T-I curves. One is non-parametric: a prototype curve, built from B-splines, is estimated; its shape is such that it fits best to the data, after shrinking or expanding along the time and intensity axes. The second model is based on a mechanistic model of sensory processes. The parameters are estimated with non-linear least squares regression.

    Garmt Dijksterhuis is a sensory psychologist and methodologist. He studied psychology of perception at the University of Utrecht and wrote his Ph.D. dissertation at the department of Data Theory at the University of Leiden, in the Netherlands. Garmt is one of the founders of the sensometrics meetings and sensometrics society, and he takes part in the scientific and organizing committees of these and other conferences. He has written or co-authored over sixty publications. Garmt taught courses in sensory science and related topics in various countries throughout the EU and has been a guest scientist at several universities and research institutes. Currently he is employed as an associate professor at the Department of Dairy and Food Science, Sensory Science section, of the Royal Veterinary and Agricultural University of Denmark. His main research interests at the moment are theories from (perception-) psychology and their relevance for the perception and appreciation of food.

    Paul Eilers is a statistician at the Medical Statistics Department of the Leiden University Medical Centre. He recently moved there after a career in the management of computing departments at other institutes. He studied electronics; many years ago he took up statistics as a hobby, but that got a little out of hand... He is very interested in exploratory data analysis, smoothing, time series analysis and statistical computing.

    Terug naar programma-overzicht.

    Estimating reaction rate constants from a two-step reaction: a comparison between two-way and three-way methods.
    Sabina Bijlsma (Process Analysis & Chemometrics research group, University of Amsterdam)

    In the literature, there are some methods available to estimate reaction rate constants from spectral data of chemical reactions. If reaction rate constants have to be estimated on-line, methods which are fast are preferable. Two-way methods like curve resolution are very popular in order to estimate reaction rate constants, because parameters of interest can be incorporated as unknowns. It is possible to combine specific (kinetic) model information and curve resolution using different constraints resulting in modifications of curve resolution methods. These modified methods always have an iterative character and therefore the exact speed needed to obtain optimal estimates of reaction rate constants is not known in advance. This makes the methods less suitable for on-line use. Three-way methods like the generalized rank annihilation method (GRAM) and trilinear decomposition (TLD) can be used to estimate reaction rate constants in cases where the contribution of different species in the mixture spectra is of exponentially nature. GRAM and TLD are both non-iterative and the exact speed of the algorithms is known in advance. This makes the algorithms suitable for on-line monitoring reaction rate constants and process control. Unfortunately, the use of GRAM and TLD may lead to rough estimates of reaction rate constants because of the non-least squares nature of the algorithms. The results from GRAM or TLD can used as an initial set of starting values for an iterative algorithm which combines the Levenberg-Marquardt algorithm with alternating least squares steps of the PARAFAC model. A comparison in performance between the two-way and three-way methods mentioned.

    Sabina Bijlsma studied for a MSc degree in Analyical Chemistry at the University of Amsterdam. She graduated in 1996. She remained at Amsterdam within the Process Analysis & Chemometrics research group of Prof. Dr. Age Smilde where she is working on her PhD. She hopes to finish this in June 2000.

    Terug naar programma-overzicht.

    Automatische herkenning van spraak en sprekers.
    Lou Boves (Katholieke Universiteit Nijmegen)

    Spraak is een verschijningsvorm van taal. Aan een gesproken uiting zijn twee aspecten te onderscheiden: (1) wat is er gezegd en (2) wie was de spreker. Automatische spraakherkenning is erop gericht om te bepalen WAT er gezegd is. Sprekerherkenning is er juist op gericht om de identiteit van de spreker vast te stellen, misschien wel los van de inhoud van de boodschap.

    De verreweg meest succesvolle aanpak van automatische spraakherkenning komt rechtstreeks uit de Informatietheorie: de 'informatie' (beschreven in de vorm van de woorden die na elkaar uitgesproken zijn) is gecodeerd in het (ruizige) spraaksignaal; herkenning is daarmee gedefinieerd als een decodeer-taak. In dit kader ligt het voor de hand om probabilistische modellen te maken van woorden. Praktisch is het handiger om modellen te maken van eenheden kleiner dan een woord, zoals klanken of lettergrepen, omdat die veel vaker voorkomen. Daarnaast worden modellen gemaakt van de frequenties waarmee woorden en sequenties van woorden voorkomen.
    In de voordracht worden de principes van probabilistische automatische spraakherkenning uiteengezet. Zijdelings wordt ook ingegaan op concrete toepassingen van de technologie.

    Sprekerherkenning wordt in wezen op dezelfde manier aangepakt als spraakherkenning, met dien verstande dat er nu probabilistische modellen gemaakt worden van de spraak van specifieke sprekers. De praktijk wijst uit dat die modellen krachtiger zijn naarmate er meer a priori informatie beschikbaar is om de kleine hoeveelheid spraak die van individuele sprekers beschikbaar zijn aan deel-modellen toe te wijzen. Ook voor sprekerherkenning zal ingegaan worden op praktische toepassingen.

    Lou Boves studeerde Nederlandse Taal- en Letterkunde aan de KUN. Hij werkte daar als wetenschappelijk medewerker bij het Instituut voor Fonetiek. In 1984 promoveerde hij op het proefschrift "The phonetic basis of perceptual ratings of running speech". In het begin van de jaren 80 breidde zijn werkterrein zich uit tot de spraaktechnologie. Hij speelde een centrale rol in diverse internationals samenwerkingsprojecten, waaronder POLYGLOT (een ESPRIT-project). Hij werkt als adviseur op het terrein van spraak- en taaltechnologie voor een aantal bedrijven, zoals KPN. Sinds 1997 is hij gewoon hoogleraar Taal- en Spraaktechnologie aan de KUN. Hij is part-time wetenschappelijk directeur van het NWO Prioriteit Programma Taal- en Spraaktechnologie, dat o.a. de ontwikkeling van een telefonisch informatiesysteem voor het Openbaar Vervoer in Nederland tot doel heeft.

    Terug naar programma-overzicht.

    Archeologie statistisch benaderd. Case-studies in seriatie.
    Jeroen Poblome (Fonds voor Wetenschappelijk Onderzoek - Vlaanderen)

    Dit paper wenst het gebruik van statistiek in de archeologische discipline te situeren. Er wordt kort uiteengezet welke de theoretische achtergrond was van de introductie van statistische technieken in de archeologie en hoe archeologen de dag van vandaag omgaan met statistische gegevens. De aandacht gaat hierbij in de eerste plaats naar seriatie.

    De uiteenzetting wordt verder toegespitst op twee konkrete voorbeelden van seriatie. Een eerste toepassing heeft betrekking op de prehistorie van de Egyptische Nijlvallei. Voor een reeks vindplaatsen wordt getracht de chronologische orde te bepalen op basis van technologische kenmerken van de vuursteenproduktie. Ten tweede wordt geïllustreerd hoe de datering van een nieuw ontdekt massa-productiecentrum van keramiek in het Romeinse Sagalassos (Turkije)wordt benaderd.

    Beide voorbeelden benadrukken duidelijk hoe seriatie en statistiek vanuit verschillende invalshoeken in de archeologie kunnen toegepast worden, en ook hoe telkens de eigenheid van het archeologisch materiaal van primordiaal belang is voor de interpretatie van de gegevens.

    Jeroen Poblome is Postdoctoraal Onderzoeker van het Fonds voor Wetenschappelijk Onderzoek - Vlaanderen. Hij is secretaris van de FWO-Onderzoeksgemeenschap "ROCT-Roman Crafts and Trade". Hij specialiseert zich in productieorganisatie en handelsmechanismen van pottenbakkers en glasblazers in het laat Hellenistische, Romeinse en laat Romeinse oostelijk bekken van de Middellandse Zee, en buigt hierbij op veldervaring in het Turkse Sagalassos. Zijn voornaamste publicatie is "Sagalassos Red Slip Ware. Typology and Chronology (Studies in Eastern Mediterranean Archaeology 2) Turnhout-Brepols 1999.

    Terug naar programma-overzicht.

    Ordination techniques for seriation in archaeology.
    Patrick Groenen (Data Theory Group, Department of Education, Leiden University)

    The main task of seriation in archaeology is establishing the unknown temporal order of archaeological findings. In this paper, a short overview is presented of multivariate ordination techniques that may be used for seriation. Special attention will be paid at how (multiple) correspondence analysis and unidimensional scaling can be used for seriation. These techniques will be used on real archaeological data that are also discussed by Jeroen Poblome.

    Patrick Groenen studied Psychology at Leiden University. He wrote his dissertation on several technical aspects of the majorization method for multidimensional scaling. At the Department of Data Theory, he was a member of the team that converted the Gifi-programs into SPSS Categories. He was a researcher in the NWO-funded Pioneer project 'Subject Oriented Multivariate Analysis' of Jacqueline Meulman. Currently, he is assistant professor at the Data Theory Group, Department of Education, Leiden University. He has written several articles in the area of multivariate analysis, multidimensional scaling, global optimization, clustering, and majorization. In 1997, a textbook called 'Modern Multidimensional Scaling' appeared of which he was a co-author of Ingwer Borg.

    Terug naar programma-overzicht.

    Constructie van phylogenetische bomen uit biologische metingen.
    Rino Zandee (sectie Theoretische Biologie en Phylo-genetische Systematiek, Instituur voor Evolutionaire en Ecologische Wetenschappen, Universiteit Leiden)

    Doel van een phylogenetische analyse is het opstellen van een hypothese omtrent de evolutionaire verwantschapsrelaties van de soorten in een biologische groep (bijvoorbeeld een geslacht of een familie). Die hypothese is gebaseerd op gegeneraliseerde waarnemingen aan groepen individuele organismen behorend tot biologische soorten. De waarnemingen betreffen onderscheidende kenmerken van allerlei aard, morfologisch, anatomisch, moleculair. Er bestaat een scala aan technieken om uit dergelijke data een phylogenetisch relatie schema te extraheren. Grofweg zijn die technieken te categoriseren in twee groepen: maximum likelihood vs parsimonie methoden. Aan maximum likelihood methoden ligt een modelmatige voorstelling van het proces van evolutie ten grondslag. Parsimonie methoden zijn veel meer gebaseerd op eigenschappen van de data als zodanig. Aan de hand van praktijk voorbeelden zullen enkele analysewegen worden toegelicht, met het accent op parsimonie methoden. Ook zal worden ingegaan op de toepassing van de resultaten van phylogeniereconstructies in andere biologische disciplines. Tot slot zal kort aandacht worden besteed aan de toepassingmogelijkheden van deze methoden in niet-biologische disciplines waar (ook) gezocht wordt naar een historische verklaring voor patronen in data.

    Rino Zandee is als universitair docent verbonden aan de Universiteit Leiden en werkzaam bij de sectie Theoretische Biologie & Phylogenetische Systematiek van het Instituur voor Evolutionaire en Ecologische Wetenschappen. Hij studeerde biologie in Leiden en promoveerde in 1982 op een biosystematisch onderzoek aan twee soort-complexen. In 1982 maakte hij een ommezwaai van empirisch naar theoretisch bioloog. Hij houdt zich sindsdien bezig met conceptuele en methodologische aspecten van het reconstrueren van de evolutionaire geschiedenis en verwantschappen van groepen biologische soorten.

    Terug naar programma-overzicht.

    Terug naar begin aankondiging jubileumcongres 1999.

    Terug naar lijst met bijeenkomsten.

     

    VOC voorjaarsbijeenkomst 1999

    De voorjaarsbijeenkomst van de VOC werd gehouden bij de Faculteit der Sociale Wetenschappen van de Universiteit Leiden in zaal SC01 (kelder) van het Pieter de la Court-gebouw op 26 maart 1999.

    De voorjaarsbijeenkomst beloofde een speciale te worden. Voor deze bijeenkomst waren we in de gelegenheid om vijf buitenlanders uit te nodigen. Vier daarvan zijn gevraagd hun aanwezigheid bij een NWO-expertmeeting (op het terrein van multivariate longitudinale data analyse) te combineren met het geven van een lezing voor de VOC. Twee van hen spraken over hun (zeer verschillende) onderzoek op het terrein van de multivariate longitudinale data analyse. De andere twee spraken over andere onderwerpen die in de VOC een belangrijke plaats innemen, namelijk multidimensional scaling en meerweg-analyse. De vijfde spreker was de toekomstige IFCS-president. Zijn toekomstige positie waardig ging hij in op het belang van (op geometrische modellen gebaseerde) classificatie voor de statistiek. Al met al leverde dit een zeer gevarieerd programma met internationaal vermaarde sprekers.

    Programma

    • 10.30-11.00 Ontvangst met koffie
    • 11.00-11.45 Michael Browne (Ohio State University, Columbus, Ohio)
    • 11.45-12.30 Mohammed Bennani (Université Haute Bretagne, Rennes)
    • 12.30-13.45 Middagpauze (lunch op eigen gelegenheid, evt. in kantine)
    • 13.45-14.30 Rasmus Bro (Royal Veterinary/ Agricultural University, Copenhagen)
    • 14.30-15.15 André Carlier (Université Paul Sabatier, Toulouse)
    • 15.15-15.45 Thee- en koffiepauze
    • 15.45-16.30 Jean-Paul Rasson (Université Notre-Dame de la Paix, Namur)
    • 16.30-16.50 Ledenvergadering
    • 16.50-18.00 Borrel

    Terug naar lijst met bijeenkomsten.

    Abstracts voorjaarsbijeenkomst 1999

    Multiplicative covariance structures for longitudinal studies
    Michael Browne (Ohio State University, Columbus, Ohio)

    Direct product models are suggested for situations where a battery of tests is administered repeatedly over a sequence of occasions to each of a sample of persons. After measurement error has been accounted for, a correlation coefficient between two tests on different occasions is regarded as the product of an inter-test correlation coefficient and an inter-occasion correlation coefficient.

    Additional structure may be imposed on the inter-test correlation matrix and on the inter-occasion correlation matrix. If the battery consists of indicators of a small number of factors, a factor analysis structure for the inter-test correlation matrix is plausible. A time-series structure for the inter-occasion correlation matrix may be used. This direct product correlation structure is generated by a factor analysis data model in which all common and specific factors follow a time series with the same parameters.

    A practical example of the application of the direct product between an ARMA time series structure and a factor analysis structure will be given.

    Michael Browne is a professor in the Department of Psychology and in the Department of Statistics at the Ohio State University. He received his Ph.D. in Statistics from the University of South Africa in 1969 and worked at the South African National Institute for Personnel Research and the University of South Africa prior to coming to the Ohio State University Sabbatical leaves were spent as Visiting Fellow at Educational Testing Service, Visiting Scholar at UCLA and Hill Visiting Professor at the University of Minnesota. Michael Browne is a past President of the South African Statistical Association and the Psychometric Society and is currently President of the Society for Multivariate Experimental Psychology. His research interests are primarily in multivariate statistical models for psychological data. Computer programs implementing methodology he has developed are available at his web-site http://quantrm2.psy.ohio-state.edu/browne/.

    Terug naar programma-overzicht.

    Global Minimization for metric multidimensional scaling by means of "Global continuation"
    Mohammed Bennani (Université Haute Bretagne, Rennes)

    Multidimensional scaling (MDS) is a collection of techniques and algorithms for fitting distance models to dissimilarity data. Two popular measures of fit were proposed: STRESS and SSTRESS. The aim of this talk is to present a method for solving the global minimization problem for MDS. In this method, to avoid directly minimizing a "complicated" function, a special integral transformation is used to transform the original function into a class of gradually smoother functions with fewer local minima. A classical optimization procedures is then applied to new functions successively, to trace their solutions back to the original function. This method will be illustrated in unidimensional, city-block and Euclidean scaling.

    Mohammed Bennani Dosse is "Maître de Conférences" at the university Haute Bretagne, Rennes, France. He received his Ph.D. in "Analyse des Données", October 1993. He wrote his dissertation on "Triadic distances". A large part of his dissertation was published in Journal of Mathematical Psychology, vol. 41, No. 2, 1997 (with Prof. W.J. Heiser). He published several papers on 3-way analysis and applications to sensory analysis. His current research concerns the development of global optimization methods in MDS.

    Terug naar programma-overzicht.

    Using constraints in multiway modeling in chemistry
    Rasmus Bro (Royal Veterinary/Agricultural University, Copenhagen)

    Constraining a (multiway) model may sometimes be helpful. This especially holds in many modeling problems in chemistry. For example one may want to resolve spectra. To ensure that the estimated spectra make sense it may be reasonable to estimate the spectra under non-negativity constraints as most spectral parameters are known to be non-negative. In general, constraints can for example help to: obtain parameters that do not contradict a priori knowledge, obtain unique solutions where otherwise a non-unique model would be obtained, avoid degeneracy and numerical problems, speed up algorithms, enable quantitative analysis of qualitative data, prevent model misspecification from giving a misleading model.

    The use of constraints in multiway analysis is discussed primarily as a way of incorporating a priori knowledge into a model. Sometimes the usefulness of the constraints is implied during an exploratory model building because of the outcome from the interim model built. The focus in this talk is on what can be benefited from such incorporation of constraints and some illustrative examples are given from the field of flow injection analysis and from process analysis using fluorescence spectroscopy.

    Rasmus Bro is assistant professor at the Royal Veterinary and Agricultural University in Frederiksberg, Denmark. He received his Ph.D. (Cum Laude) in chemistry, November 1998, at the University of Amsterdam. Since 1995, he published many papers on multiway analysis in chemometrics, and is currently working on a book on this topic. His research is primarily concerned with developing new mathematical algorithms for multiway methods to be used in, for instance, the food industry. Specifically, he develops models for handling complicated data by allowing (possibly vague) a priori knowledge to be incorporated concisely in the model. Furthermore, he maintains an encyclopedia-like web-site http://newton.foodsci.kvl.dk/rasmus.html/ containing many links to other websites, a database with thousands of papers, and a toolbox and tutorial for multiway analysis.

    Terug naar programma-overzicht.

    Distances between trajectories for longitudinal data
    André Carlier (Université Paul Sabatier, Toulouse)

    The analysis of multivariate longitudinal data can lead to the representation of units i (i=1,...,I) in the vector space Rp by a sequence of vectors xi(tk), k=1,...,K in Rp . In this representation, each vector xi(tk) has for coordinates the observations of p variables on the unit i at time tk (k=1,...,K). In making graphical displays of these data, a frequent approach consists of linking each pair of consecutive points xi(tk), xi(tk+1) by line segments. The polygones obtained have been called "observed trajectories". For some multivariate exploratory methods applied on data involving time and in which specific properties of the time are not taken into account, this use can be a way to include the time, as "a posteriori information" in the graphical results of the methods. This makes the interpretation of the graphic easier with a small number of trajectories, but the visualisation of more than a dozen of them will lead to an unreadable graphic. For this reason, but also for helping the interpretation of such data, different authors have suggested clustering methods on trajectories. At the basis of the clustering algorithm is the definition of distances between trajectories.

    In this presentation, we provide a review of these distances between trajectories and compare their properties from a general point of view and on an example. In some cases, distances appear to be only "semi-distances" because they do not satisfy all the properties of distances. In a first step, we make a typology of distances according to the characteristics that they take into account. For example, in some case, we are more interested in the direction of the changes in the time than in the locations of the vectors xi(tk) at each time. In a second step we consider synchronous distances versus asynchronous ones. For the first set of distances, trajectories are close if they describe the same curve at the same times in the space Rp . For asynchronous distances, two trajectories are close if they describe approximately the same curve in the space Rp , even if they have described the curve at different times. Graphical displays, obtained after clustering of trajectories using different distances enlighten the different properties of these distances. To conclude, some conditions that ensure a relevant use of these distances are discussed.

    André Carlier is assistant professor at Université Paul Sabatier in Toulouse, France. His research focuses mainly on multivariate statistics, 3-way data analysis and modeling. He developed a method for the analysis of a set of contingency tables observed in the time (Longitudinal Analysis of Contingency Tables), and has worked on 3-way correspondence analysis. He is also concerned with discriminant analysis on categorical data and is a co-author of a book on this topic. As a specialist of S-PLUS, he developed a module of S-PLUS functions for multivariate data analysis (the module MULTIDIM) and for some 3-way methods. All his research is directed towards applications to different areas of the statistics as pharmaceutical research (for which he directed a Ph.D.), hail suppression programs and marketing.

    Terug naar programma-overzicht.

    Geometrical tools in classification
    Jean-Paul Rasson (Université Notre-Dame de la Paix, Namur)

    Being at the frontier between Statistics and Classification, I have for a long time been fascinated by this quotation (J.A. Hartigan in Encyclopedia of Statistical Sciences): "I expect that theories of classification will do as much to clear up the murky foundations of probability and statistics as theories of probability and statistics will do to clear up the murky foundations of classification". In this spirit, I am planning, considering the problems of the estimation of a convex support in Statistical Geometry and, in Classification, the "convexity methods" (cfr H.H. Bock), to show the links between them. Precisely, I want to show how these disciplines can be tools for each other, bringing to the other, new ways of thinking and new justifications for some solutions. We will concentrate on convex bodies mainly and show how this extends to more general shapes.

    Jean-Paul Rasson is full professor and director of the Department of Mathematics at the "Facultés Universitaires Notre-Dame de la Paix" at Namur (Belgium). He prepared his Ph.D. in Paris (Université René Descartes) and at the University of Cambridge (U.K.) and received it in 1978. His research is mainly concerned with the Geometrical aspects of Statistics, Pattern Recognition and Classification with, as favorite field of applications: image analysis. He has been the President of the SFC (Société Francophone de Classification) from 1996 to 1998 and is now the Vice President and President-Elect of the IFCS.

    Terug naar programma-overzicht.

    Terug naar lijst met bijeenkomsten.

     

    VOC najaarsbijeenkomst 1998

    De najaarsbijeenkomst van de VOC werd gehouden op het Ministerie van Financiën te Den Haag op vrijdag 13 november 1998. De bijeenkomst bestond uit een vijftal lezingen die zich richtten op de rol van ordinatie en classificatie in de financiële wereld. De lezingen liepen uiteen van de voorspelling van faillissements- en kredietrisico's tot aan de beschrijving van het verloop van rente- en beurskoersen. Voorafgaand aan de lezingen was er een welkomstwoord van het hoofd van de afdeling Onderzoek van het Ministerie, waarin hij kort inging op het onderzoek dat door zijn afdeling wordt uitgevoerd en het gebruik van ordinatie- en classificatietechnieken daarbij.

    Programma

    • 10.30-11.00 Ontvangst met koffie
    • 11.00-11.15 Welkomstwoord door Sjoerd Peereboom (afdeling Onderzoek, Ministerie van Financiën)
    • 11.15-11.50 Hans van 't Zand (Operations Research, ABN AMRO Bank)
    • 11.50-12.25 Paul Pompe (Financieel Management en Bedrijfseconomie, Universiteit Twente)
    • 12.25-13.40 Middagpauze (lunch aangeboden door het Ministerie van Financiën)
    • 13.40-14.15 Peter Schotman (Limburg Institute of Financial Economics, Universiteit Maastricht)
    • 14.15-14.50 Patrick Groenen (Datatheoriegroep, Universiteit Leiden) en Philip Hans Franses (Econometrisch Instituut, Erasmus Universiteit Rotterdam)
    • 14.50-15.15 Thee- en koffiepauze
    • 15.15-16.15 Joe Whittaker (Department of Mathematics and Statistics, Lancaster University)
    • 16.15- Borrel (aangeboden door het Ministerie van Financiën)

    Terug naar lijst met bijeenkomsten.

    Abstracts najaarsbijeenkomst 1998

    Credit rating at ABN AMRO
    Hans W. van 't Zand (Operations Research Department, ABN AMRO Bank)

    In the introduction of this presentation the importance of credit rating will be explained, and the different types of rating models that are used in different market segments: models that make use of stock prices, expert models and statistical models.

    The main part of the presentation will focus on two statistical credit rating models: an application scorecard for small businesses, and a revision scorecard for medium sized companies. The design of the models, the data collection and preprocessing, the employed statistical techniques and the results will be presented. Special attention will be given to some specific problems, such as reject inference (the problem that our data are not unbiased, because we have no data on rejected clients) and overfitting.

    Hans van 't Zand studeerde Toegepaste Wiskunde aan de Universiteit Twente van 1971 tot en met 1978, mett Operations Research als afstudeerrichting. Vanaf 1979 is hij werkzaam bij de ABN AMRO Bank, hoofdafdeling. Operations Research. Hij werkt er als consultant en maaktt daarnaast deel uit van het management team van de afdeling. De afdeling Operations Research functioneert binnen het hoofdkantoor van de bank als intern adviesbureau voor beleids- en beslissingsondersteuning, daarbij gebruik makend van hulpmiddelen uit statistiek, wiskunde en informatica.

    Terug naar programma-overzicht.

    Faillissementspredictie: een vergelijking tussen lineaire discriminant analyse en neurale netwerken
    Paul P.M. Pompe (vakgroep Financieel Management en Bedrijfs-economie, Universiteit Twente)

    In de afgelopen decennia is veel onderzoek verricht naar het ontwikkelen van modellen voor het voorspellen van faillissementen. Een model voor het voorspellen van faillissementen heeft veelal tot doel het beschrijven van de relatie tussen een naderend faillissement en een aantal verklarende financiële ratios. Deze ratios kunnen berekend worden met informatie uit de jaarrekening van een onderneming. Het uiteindelijke doel is het verkrijgen van een instrument om een faillissement vroegtijdig te kunnen signaleren. Vaak wordt bij het afleiden van een faillissementsmodel gebruik gemaakt van de statistische methode lineaire discriminant analyse (lda). Sinds enige jaren bestaat er ook aandacht voor de methode neurale netwerken.

    De presentatie gaat over een onderzoek waarin de prestaties van lda en neurale netwerken bij het voorspellen van faillissementen zijn vergeleken. Beide methoden werden toegepast op een dataverzameling met jaarrekeningen afkomstig van lopende en failliette Belgische ondernemingen. Lda en neurale netwerken bleken gelijkwaardig te presteren. Alleen in het geval een model uit heel weinig data werd afgeleid waren de prestaties van neurale netwerken beter.

    Paul Pompe studeerde Bedrijfskunde (1986-1993 ) en Informatica(1988-1993) aan de Universiteit Twente. Sinds 1993 is hij als medewerker Onderzoek en vervolgens als AiO werkzaam bij de vakgroep Financieel Management & Bedrijfseconomie van de faculteit Technologie & Management, Universiteit Twente. Zijn promotie-onderzoek gaat over faillissementspredictie..

    Terug naar programma-overzicht.

    Factormodellen voor de termijnstructuur van de rente
    Peter C. Schotman (Limburg Institute of Financial Economics, Universiteit Maastricht)

    De rentetermijnstructuur beschrijft de relatie tussen de looptijd van een contract en de hoogte van de rente. Gemiddeld genomen is de korte termijn rente lager dan de lange termijn rente, en fluctueert de korte rente meer. Modellen voor de termijnstructuur worden gebruikt om andere, afgeleide, financiële titels te waarderen. En voorbeeld is een hypotheekcontract met een bepaalde rentevast periode, rentebedenktijd, instaprentes en soms nog veel meer bepalingen. Om een dergelijk product te waarderen is het nodig rentescenarios te ontwikkelen voor de volledige termijnstructuur. Dit is alleen dan hanteerbaar als we de dynamiek van rentes met verschillende looptijden (uiteenlopend van één dag tot twintig jaar) kunnen samenvatten in een model met slechts een beperkt aantal bronnen van stochastiek. Helaas zijn standaard faktormodellen of principale componenten analyse hiervoor niet geheel geschikt, omdat deze puur statistische technieken geen rekening houden met de financieel economische eis dat de rentescenarios arbitrage vrij moeten zijn. In de lezing wil ik ingaan op het speciale type factor modellen dat gebruikt wordt bij het modelleren van de termijnstructuur.

    Peter Schotman (1960) studeerde econometrie van 1978 tot 1984 aan de Erasmus Universiteit Rotterdam (EUR). In 1989 promoveerde hij op het proefschrift Empirical Studies on the Behaviour of Interest Rates and Exchange Rates bij de promotoren Eduard Bomhoff en Teun Kloek. In 1991 kwam hij in dient bij de Universiteit Maastricht, eerst als UHD bij de sectie financiering van de Faculteit der Economische Wetenschappen en Bedrijfskunde, en vanaf 1994 als hoogleraar Finance and Econometrics bij het onderzoeksinstituut LIFE (Limburg Institute of Financial Economics). Zijn onderzoeksinteresse is de econometrie van financiële markten. Peter Schotman is in de zomers van 1990 en 1993 als bezoeker verbonden geweest aan het Institute of Empirical Macreconomics (University of Minnesota, Federal Reserve Bank Minneapolis), en is in 1992 gasthoogleraar geweest aan de Woodrow Wilson School of Public and International Affairs van Princeton University. In het voorjaar van zowel 1996 als 1997 is hij gasthoogleraar geweest bij GREQAM (Groupement de Recherche en Économie Quantitative d'Aix-Marseille) in Marseille.

    Terug naar programma-overzicht.

    Visualizing similarities across stock markets
    Patrick J.F. Groenen (Datatheoriegroep, Universiteit Leiden) en Philip Hans Franses (Econometrisch Instituut, Erasmus Universiteit Rotterdam)

    There are various reasons for analyzing similarities across stock markets returns and volatilities. Economic motivations include possible insights into diversification opportunities and specific features of emerging markets. A statistical motivation is to obtain ideas for multivariate model specification. We propose to visualize correlation structure by multidimensional scaling technique, where this structure is allowed to vary over time. We illustrate our method for daily data (covering 1986 to 1995) on 13 country-specific stock markets using a dynamic computer display. One of our findings is that Asian stock markets tend to behave more similar towards the end of the sample.

    Patrick Groenen studied Psychology at Leiden University. He wrote his dissertation on several technical aspects of the majorization method for multidimensional scaling. At the Department of Data Theory, he was a member of the team that converted the Gifi-programs into SPSS Categories. Currently, he is a researcher in the NWO-funded Pioneer project 'Subject Oriented Multivariate Analysis' of Jacqueline Meulman. He has written several articles in the area of multivariate analysis, multidimensional scaling, global optimization, clustering, and majorization. In 1997, a textbook called 'Modern Multidimensional Scaling' appeared of which he was a co-author of Ingwer Borg.

    Philip Hans Franses is Hoogleraar Toegepaste Econometrie, verbonden aan het Econometrisch Instituut van de Erasmus Universiteit Rotterdam. Zijn onderzoek betreft vooral de toepassing van econometrische methoden en technieken bij financiele, marketing en macroeconomische vraagstukken. Veel aandacht gaat uit naar de analyse van tijdreeksen. Hierover heeft hij onlangs een tekstboek, getiteld "Time Series Models for Business and Economic Forecasting", gepubliceerd bij Cambridge University Press.

    Terug naar programma-overzicht.

    Graphical and other models for large tables occurring in credit scoring
    Joe Whittaker (Department of Mathematics and Statistics, Lancaster University)

    The practical problems of fitting graphical models to categorical data arise because the number of cells in the table climbs exponentially with the number of dimensions. The log-linear all-way interaction model has far too many parameters to be well estimated from realistic sample sizes. Certain strategies can be adopted to alleviate this problem. Firstly, constraining interactions higher than two to vanish, which still retains the conditional independence structure of the graphical model but reduces the number of parameters to a quadratic function of the dimension, as in Whittaker (1990). Secondly, summing the normalising constant over all cells during the iterative fitting procedure can be prohibitively time-consuming; replacing this sum by a Monte Carlo approximation following the idea of Geyer and Thompson (1992) together with importance sampling to mimic the realised data configuration, leads to practical improvements with only moderate loss of precision. Thirdly, single dimensions with large numbers of categories, also lead to large number of parameters, and while these are not constrained they may satisfy certain intuitive smoothness criterion, for instance, ordinal data. In certain cases penalising the likelihood leads to more satisfactory and stable estimates. One way of assessing the fit of such models, that is the trade off between bias and complexity, is by using AIC, or the network information criterion, discussed in Ripley (1996).

    We use these techniques to analyse the applications of individuals for credit cards. Here the data is fairly abundant, but the number of questions is typically of the order of 30, with the number of categories for each response varying from binary to 50 or so age groups. The standard approach to such data is to either examine regression models for selected variables of interest, or to look at low dimensional marginal cross classifications, or to group adjacent categories, or to use a combination of these techniques. We show what improvements may be expected by employing a principalled and optimised approach to the analysis.

    Joe Whittaker is a Senior Lecturer in the Department of Mathematics and Statistics, Lancaster University. He has held various positions such as member of Council of the Royal Statistical Society, member of the Research Section, and chairperson of its Multivariate Study Group, elected member to the Board of Directors of the European Region of the International Association for Statistical Computing,visiting Professorships in the Department of Statistics at Colorado State University, University of Chicago, and the Paul Sabatier University. Furthermore, he was co-organiser of the conference on Probabilistic Graphical Models, held at the Isaac Newton Institute, Cambridge, part of the Neural Networks and Machine Learning 1997 Program in which he was a Visiting Fellow.

    At the moment, he is a fellow of the Royal Statistical Society and the Program Chair for Uncertainty99, the Seventh International Workshop on Artificial Intelligence and Statistics, to be held in Florida 1999.

    His research interests lie in graphical modelling andtransportation modelling. He is the author of over 60 articles in the research literature. He has written an international best selling graduate text on graphical models which has been reprinted six times since 1990. He has given numerous short courses on graphical models some sponsored by the Economic and Social Research Council, U.K. and has been given in England, Holland, France, Italy, Brazil and the USA. He has co-organised three recent workshops on short term traffic forecasting, in Lancaster and Delft, the contents of one formed a special issue of the International Journal of Forecasting, 1997.

    More recently he has taken an interest in credit scoring which has formed an interesting application area for ideas from graphical models.

    New graduate students are always very welcome.

    Terug naar programma-overzicht.

     

    VOC voorjaarsbijeenkomst 1998

    De voorjaarsbijeenkomst van de VOC werd gehouden in het Wiskunde-gebouw van de Landbouwuniversiteit Wageningen op vrijdag 24 april 1998, zaal 3/4. Het thema van de voorjaarsbijeenkomst was de toepassing en ontwikkeling van ordinatie en classificatie-methoden binnen het landbouw- en milieukundig onderzoek. In de lezingen werd zowel aandacht geschonken aan hoe problemen in het landbouw- en milieukundig onderzoek hebben geleid tot nieuwe ordinatie en classificatie-methoden, alswel hoe bestaande technieken toegepast kunnen worden in nieuwe contexten. Na de lezingen is de jaarlijkse ledenvergadering gehouden. Ter afsluiting van de bijeenkomst was er een borrel.

    Programma

    • 10.30-11.00 Ontvangst met koffie
    • 11.00-11.30 Pieter Kroonenberg (Pedagogiek, RU Leiden)
    • 11.30-12.00 Hans Jansen en Annemarie de Jong (Centrum voor Biometrie Wageningen)
    • 12.00-12.30 Fred van Eeuwijk (Statistiek, LU Wageningen)
    • 12.30-13.45 Middagpauze
    • 13.45-14.15 Paul Eilers (DCMR Milieudienst Rijnmond, Schiedam) en Brian Marx (Louisiana State University)
    • 14.15-14.45 Cajo ter Braak (Centrum voor Biometrie Wageningen) en Paul van den Brink (DLO - Staring Centrum Wageningen)
    • 14.45-15.15 Thee- en koffiepauze LI>15.15-16.15 Jean-Baptiste Denis (Laboratoire de Biometrie, INRA Versailles) en Javier Moro-Serrano (CIFOR, INIA, Madrid)
    • 16.15-16.45 Ledenvergadering
    • 16.45- Borrel

    Terug naar lijst met bijeenkomsten

    Abstracts voorjaarsbijeenkomst 1998

    Principale compenten analyse voor kwantitatieve en kwalitatieve variabelen; een toepassing op genotype x milieu-interactie
    Pieter M. Kroonenberg (Pedagogiek, RU Leiden),B.D. Harch (CSIRO, Biometrics, Glen Osmond, SA, Australië), K.E. Basford (The University of Queensland, Agriculture, Brisbane, Qld, Australië), A. Cruickshank (J. Bjelke-Petersen Research Station, Kingaroy, Qld, Australië)

    The set of descriptors for the accessions of most germ-plasm collections consist of both numerical and categorical variables. This poses problems for a combined analysis of all descriptors as not many statistical techniques deal with mixtures of measurement types. In this paper, nonlinear principal component analysis was used to analyse the descriptors of the accessions in the Australian groundnut collection. It was demonstrated that the nonlinear variant of ordinary principal component analysis is an appropriate analytical tool in that subspecies and botanical varieties could be identified on the basis of the analysis and characterised in terms of all descriptors. Moreover, outlying accessions could be easily spotted and their characteristics established.

    Pieter Kroonenberg werkt als statisticus aan de subfaculteit Pedagogische Wetenschappen, Rijksuniversiteit Leiden. Hij heeft toegepaste wiskunde gestudeerd in Leiden en een proefschrift geschreven bij de Faculteit Sociale Weten-schappen in Leiden over drieweg-componentenanalyse. Zijn voornaamste onderzoeksinteresses liggen op het terrein van drieweg-analyse en de toepassing van geavanceerde statistische technieken op empirische gegevens. Hij heeft artikelen gepubliceerd in zowel methodologische als inhoudelijke tijdschriften in een groot aantal disciplines.

    Terug naar programma-overzicht.

    Statistische analyse van de genetische structuur van natuurlijke populaties
    Hans Jansen en Annemarie de Jong (Centrum voor Biometrie, Wageningen)

    Natuurlijke populaties bestaan uit individuen met genetische eigenschappen die zijn verankerd in hun DNA. Met behulp van moleculair-biologische technieken is het mogelijk om verschillen in het DNA van individuen zichtbaar te maken. Dit biedt enerzijds de mogelijkheid om inzicht te verkrijgen in de genetische variatie binnen populaties, anderzijds biedt het de mogelijkheid om verschillende populaties te vergelijken m.b.t. hun genetische samenstelling. In deze lezing wordt ingegaan op verschillende statistische methoden om de genetische structuur van populaties weer te geven.

    Hans Jansen is senior onderzoek en projectleider Statistische Genetica en houdt zich bezig met statistisch-genetische aspecten van moleculaire merkers in plantenveredeling.

    Annemarie de Jong houdt zich als onderzoeker bezig met de statistische analyse van de genetische structuur van natuurlijke populaties.

    Terug naar programma-overzicht.

    Statistische aspekten van het gebruik van moleculaire merkers voor het onderscheiden van plantenrassen
    Fred van Eeuwijk (Agro-, Milieu- en Systeemtechnologie, LU Wageningen)

    Voor het kunnen verkrijgen van bescherming op een door veredeling nieuw ontwikkeld plantenras is het noodzakelijk dat het nieuwe ras significant verschilt van reeds erkende rassen. Traditioneel worden daartoe experimenten uitgevoerd waarbij gedurende meestal twee jaren kandidaatrassen worden vergeleken met erkende rassen. Een op de t-test gebaseerde procedure moet uitsluitsel geven over het al of niet significant verschillend zijn van de kandidaatrassen. De kenmerken waarop vergelijkingen worden uitgevoerd zijn voor ieder gewas vastgelegd en betreffen fenotypische, aan de plant waarneembare, kenmerken. Deze kenmerken zijn milieu-gevoelig en hebben over het algemeen een beperkt oplossend vermogen.

    Moleculaire merkertechnologie maakt het sinds een aantal jaren mogelijk random gedeelten van het DNA zichtbaar te maken. Op DNA-nivo lijkt een grote variatie aanwezig tussen rassen. Zelfs nauwverwante rassen kunnen al aanzienlijke verschillen vertonen. De vraag is daarom gerezen wat de mogelijkheden zijn van de merkertechnologie om verschillen tussen rassen zichtbaar te maken. De claim is dat variatie op DNA-nivo een veel groter onderscheidend vermogen zou hebben omdat er, ten eerste, veel meer variatie aanwezig is die, ten tweede, zij niet aan milieu-invloeden onderhevig is.

    Het gebruik van merkertechnologie heeft echter ook allerlei statistische haken en ogen. De belangrijkste betreft het niet random samplen van het DNA zoals dat aanwezig is in de chromosomen. In de voordracht staat het probleem centraal van het vaststellen van verschillen tussen rassen aan de hand van een groot aantal indicatorvariabelen (vaak meer variabelen dan eenheden). De bedoeling is minimumsets van onderscheidende merkers te vinden onder gelijktijdige minimalisatie van de steekproefomvangen. Verschillende procedures worden vergeleken: variabelenselectie-procedures gebaseerd op (logistische) regressie en stapsgewijze discriminanten-analyse, PLS, correspondentie-analyse en smoothingtechnieken.

    Fred van Eeuwijk studeerde Biologie aan de Universiteit van Utrecht. In 1985 begon hij te werken als statisticus aan wat nu het DLO-Centrum voor Plantenveredelings- en Reproductie-Onderzoek (Wageningen) heet. In 1996 promoveerde hij op een proefschrift over de statistische analyse van genotype-milieu-interactie. Sinds augustus 1997 is hij verbonden aan de leerstoelgroep Statistiek van het Departement Agro-, Milieu- en Systeemtechnologie, Landbouwuniversiteit Wageningen. Zijn voornaamste interesse betreft de analyse van interactie-structuren, met vooral toepassingen in plantenveredeling.

    Terug naar programma-overzicht.

    Regressie op talloze variabelen. Een nieuwe aanpak van multivariate calibratie
    Paul Eilers (DCMR Milieudienst Rijnmond) en Brian Marx (Lousiana State University)

    Het woord talloze in de titel is niet letterlijk bedoeld, maar het gaat hier om situaties die er dicht bij komen. Men heeft een continu "signaal" (tijdreeks, spectrum, ruimtelijk profiel) bemonsterd op vele punten en wil de uitkomsten gebruiken om een afhankelijke variabele te voorspellen.

    De bekendste verschijningsvorm van dit probleem staat in de chemometrie bekend als multivariate calibratie. Er zijn dure maar betrouwbare metingen naast goedkope spectrometrische resultaten, die voor een groot aantal golflengten de reflectie geven. Je wilt de eerste soort metingen (in de toekomst) uit de laatste voorspellen.

    Regressie mislukt hier jammerlijk, want de spectra tellen honderden golflengtes, bij enkele tientallen calibratiemonsters. Men heeft geëxperimenteerd met selectie van variabelen, om de meest informatieve golflengtes te vinden. Het is echter gebruikelijker de gegevens te projecteren op een orthogonale vectorruimte met een lage dimensie. Dat heeft geleid to PCR (principal components regression), met als basisvectoren de eigenvectoren van alleen de verklarende variabelen, en PLS (partial least squares), waarbij ook de afhankelijke variabele betrokken wordt in de constructie van de basisvectoren. Op eerdere VOC-bijeenkomsten is aan PLS aandacht besteed.

    Deze presentatie gaat over een nieuwe aanpak: aan de vector van regressiecoëffiënten wordt een glad verloop opgedrongen d.m.v. B-splines en een "penalty". Het probleem wordt daardoor oplosbaar en numerieke complicaties verdwijnen. Als acronym hebben we PSR bedacht: penalized signal regression. Doordat het model gebaseerd is op lineaire regressie met een penalty, laat het idee zich gemakkelijk uitbreiden naar complexere problemen, met tellingen of proporties als afhankelijke variabelen. Enkele voorbeelden zullen dat illustreren.

    Paul Eilers is opgeleid in de zwakstroomtechniek. Bij zijn eerste werkgever, TNO-IWECO (1971-1974), onderzocht hij mechanische trillingen en raakte geboeid door statistische signaalanalyse. Later (1975-1982) werkt hij bij de afdeling Milieu-onderzoek van de Deltadienst van Rijkwaterstaat. Het was de tijd van de Oosterscheldedam en weinig organismen ontsnapten aan de aandacht van de onderzoekers. Dat leverde interessante ordinatie- en classificatieproblemen op, naast de multivariate analyse van meetreeksen. Sind 1982 is hij hoofd van het bureau Dataverwerking en Automatisering van de DCMR Milieudienst Rijnmond in Schiedam. Metingen (lucht, geluid en bodem) en milieuklachten leveren interessant en omvangrijk materiaal. Paul heeft een brede statistische belangstelling, maar zijn stokpaardje is de toepassing van "smoothing", "penalties" en verwante technieken.

    Terug naar programma-overzicht.

    Principal response curves for the analysis of time-dependent multivariate responses in ecotoxicology
    Cajo J.F. ter Braak (Centrum voor Biometrie, Wageningen) en Paul J. van den Brink (DLO-Staring Centrum, Wageningen)

    A novel multivariate method is proposed for the analysis of multivariate response data from a designed experiment that is repeatedly sampled in time. The long-term effects of the insecticide chlorpyrifos on an invertebrate community in outdoor experimental ditches are used as example data. The new method, which we baptized the Principal Response Curve method, is based on a reduced rank regression that is adjusted for changes across time in the control treatment. This allows the method to focus on the time-dependent treatment effects. The principal component thereof is plotted against time. The method is well-received by the agro-chemical industry. However appealing the results are so far, there are many interesting ways to further improve the method!

    Cajo J.F. ter Braak is a senior consultant at the Centre for Biometry Wageningen, the Netherlands, with specialization in multivariate methods for species-environment relationships. He is co-author of the textbook Data Analysis in Community and Landscape Ecology (Cambridge University Press, 1995), the inventor of canonical correspondence analysis and author of the program Canoco, which is the de-facto standard for ordination of ecological data. See http://www.cpro.dlo.nl/cbw/

    Paul van den Brink werkt op de afdeling milieu-bescherming van het Staring Centrum (SC-DLO). Zijn onderzoek concentreert zich rond risico-evaluatie van bestrijdingsmiddelen in het oppervlaktewater. Hiertoe worden experimenten uitgevoerd met behulp van experimentele ecosystemen (zogenaamde microcosms of mesocosms). Paul is verantwoordelijk voor de statistische evaluatie van de resultaten als het ontwikkelen van effect-modellen.

    Terug naar programma-overzicht.

    A clustering procedure for a qualitative X qualitative X quantitative three-way table based on a non-symmetric similarity measure: Identifying multivariately superior genotypes over environments
    Jean-Baptiste Denis (Laboratoire de Biométrie, INRA, Versailles) en Javier Moro-Serrano (CIFOR, INIA, Madrid)

    Each year, plant breeders select a subset of genotypes from a large initial collection of candidates. The aim is to choose genotypes adapted to different soil and climatic conditions on the basis of their performance on a number of criteria (characters, attributes) simultaneously. From a statistical viewpoint decisions are based on a three-way table of genotypes (qualitative) x locations (qualitative) x attributes (quantitative). Usually data stem from independent randomized blocks experiments carried out at different locations. For statistical analysis genotypes and locations are factors and the criteria are variates whose correlations should be taken into account.

    A procedure is proposed to eliminate uniformly badly performing genotypes. This procedure partitions the set of genotypes in such a way as to minimize the number of qualitative interactions within the groups, i.e., interactions which cause rank-reversals between genotypes going from one environment to another. A hierarchical clustering method helps to identify the groups. The clustering is based on a so-called 'inferiority' score, an asymmetric similarity measure between pairs of genotypes. The score indicates whether a particular genotype is systematically superior to others or whether it is outcompeted by them. After computation of the scores a dendrogram is constructed. Special cutting rules for the dendrogram were developed based on the number of significant qualitative interactions. In the obtained groups, a leader genotype is supposed to perform better than any other genotype in that group. Plant breeders may restrict their attention to those leaders. The proposal is exemplified by an analysis of a real data set.

    Jean-Baptiste Denis started working at the Institut National de la Recherche Agronomique (INRA, France) as a statistician researcher in 1971. In the period 1976-1977 he worked as visiting scientist at the Instituto Nacional de Investigaciones Agrarias (INIA, Spain). In 1983 he became Docteur-Ingénieur de l'Institut National Agronomique de Paris-Grignon. He was head of the Biometry Unit of INRA-Versailles from 1982 to 1988. Presently he is senior scientist at that same unit.

    His interests centre on the modelling of factorial interactions in the ANOVA framework, and the application of statistics to biology (mainly plant breeding). He collaborates with many European colleagues. He published in many journals, statistical (Biométrie-Praximétrie; Biometrics, Revue de la Statistique Appliquée; Utilitas Mathematica, Statistics, Journal of Applied Statistics, Applied Statistics) as well as applied (Agronomie, Biuletyn Oceny Odmian, Theoretical and Applied Genetics, Crop Science; Agronomy Journal, Canadian Journal of Plant Science, Current Genetics, Euphytica, Plant Breeding, Heredity, etc.)

    Terug naar programma-overzicht.


    Terug naar lijst met bijeenkomsten


    Webmaster:Michel van de Velden, Erasmus Universiteit, Rotterdam
    Last update: 05-10-2006