Machine Learning Methods in Visualisation for Big Data

Tutorial co-located with EuroVis 2016, June 2016, Groningen, the Netherlands
Monday June 6, 2016, 14:00-18:00

In order to handle big data challenges, machine learning techniques can be advantageous in simplifying and summarising large data sets for visualisation. Machine learning provides methods that allow the summarisation of very large data sets whereas visualisation leverages the human visual system to help find unanticipated patterns. In this tutorial, we cover machine learning methods relevant to the area of visualisation. In addition to an exploration of the applicability, strengths, and weaknesses of such approaches, we provide links to available software tools that can help provide solutions to machine learning problems.

Registration for this tutorial is handled through the Eurovis 2016 registration system. Mark the checkbox of the tutorial during registration to indicate you wish to attend. Participation in this tutorial is limited to 50 and registration will happen on a first come first serve basis. If you have any further queries please contact the tutorial organizers at: mlinvis16@gmail.com

Introduction and Motivation

Machine Learning (ML) approaches provide powerful tools for the classification and summarisation of large quantities of data. These automated or semi-automated approaches allow the systems of today to scale to large data sets. The methods are critical for the big data problem and can provide valuable benefits for the field of information visualisation through increased scalability. Methods from ML can be used to simplify data to render it accessible to visualisation systems. Parametric models have the advantage that they are also useful for big data, because they can be trained on a representative subset and afterwards applied to all the data. This makes them much more scalable and also allows the user to test generalisation, which matters as all analytics is fundamentally statistical.

Information Visualisation (IV) provides interactive methods for the visual representation of data. The tools and techniques of our field leverage the human perceptual system and the ability of the user to explore and explain patterns in data. Our systems can provide a means to discover unanticipated patterns in data sets that can be subsequently investigated quantitatively. However, the visual system has its limitations. The human vision is intrinsically limited to two or three dimensions and only a few combined features can be handled in a comprehensible way. For high-dimensional data exhaustive human analysis of all data features and their combinations can become arduous or infeasible.

Machine learning can help IV by providing methods to summarise and reduce complex data to levels that can be understood by humans; such summarised representations can then be integrated into visualisation systems, complementing their existing capabilities. ML is particularly powerful in this context because ML algorithms are well adapted to extracting relevant information from high-dimensional data sets following mathematical objectives. The challenge of applying machine learning to information visualisation is that it is an unsupervised task: there is no target variable providing a correct answer for the model to aim at. This is because the goal is often to explore data beyond what is known through existing annotation or hypotheses. This challenge has required innovation from the ML community, particularly in devising effective optimisation criteria (the so-called 'cost function'). It also leads to challenges in comparing the results from different methods to determine which are most effective.

In order to tackle big data problems our two communities need to leverage the advantages that the two fields can provide to each other. However, our fields are only beginning to work together. This tutorial is designed to cover relevant machine learning methodologies for visualisation and provides some practical resources that participants can use for the visualisation techniques and systems that they design. Our tutorial assumes a visualisation audience and covers the relevant tools and techniques for machine learning from this perspective. In addition to the scientific content, we present existing software solutions that researchers and practitioners can use in order to apply these techniques immediately.

Main Topics of the Tutorial

Introduction and motivation [Presenters: Ian Nabney, Daniel Archambault]
Dimensionality reduction [Presenters: Ian Nabney, Jaakko Peltonen]
Clustering [Presenters: Ian Nabney, Jaakko Peltonen]
Analysis of multivariate graphs and graph mining [Presenter: Daniel Archambault]
Data lab: bring your own data to be analyzed with help from the presenters.

Tutorial Schedule

14:00-14:20 Introduction and motivation (20 mins) [Presenters: Ian Nabney and Daniel Archambault]
This section will provide a brief introduction to the key elements of machine learning and relate them to information visualisation:
- introduction to data models, parametric, semi-parametric and non-parametric;
- an overview of cost functions, modelling, Bayes' theorem and Bayesian inference, and optimisation;
- framework for the application of machine learning to visualisation methods.
- Slides in PDF format.
Dimensionality reduction (80 mins):
- 14:20-14:50 Latent variable and generative models (30 mins) [Presenter: Ian Nabney]
  - Brief comparison of the two main modes of data projection (or dimensionality reduction), generative models and non-generative models.
  - Principal Component Analysis (PCA) is defined as a generative model and it is shown how it can be generalised to a non-linear projection as a density model for the data (latent variable model exemplified by Generative Topographic Mapping -- GTM).
  - Other formulations of latent variable models that can be used for visualisation, such as the Gaussian Process Latent Variable Model (GPLVM).
  - Discussion of how the probabilistic nature of generative models can be exploited by extensions of the basic algorithm to deal with missing values, discrete and mixed data types, hierarchies, temporal data, and feature selection.
  - Illustrations from real applications are provided throughout. Demonstrations will use the DVMS visualisation toolkit (in Matlab).
  - Slides in PDF format
- 14:50-15:20 Non-generative models (30 mins) [Presenter: Jaakko Peltonen]
  - Multidimensional Scaling (MDS) and its variants, a widely used family of methods aiming to preserve pairwise distances of data items, where variants emphasise whether large or small distances, direct or geodesic distances, or even distance ranks are most important to preserve.
  - Methods that preserve similarities (neighbourhood relationships) instead of distances, such as the Stochastic Neighbour Embedding and Neighbour Retrieval Visualizer family. We show the methods are interpretable in an information retrieval framework.
  - Illustrations and demos are provided throughout.
  - Slides in PDF format
- 15:20-15:40 Software activity with participants (20 mins) [Presenters: Ian Nabney and Jaakko Peltonen]
  - Participants are provided with links to software tools that they can download before the tutorial. During this segment of the tutorial, either a demonstration of these tools will be provided to the participants or participants will have an opportunity to work with the provided tools on problems in dimensionality reduction. The activity will reinforce the concepts presented in the dimensionality reduction segment of the tutorial.
15:40-16:10 BREAK (30 mins)
16:10-16:40 Clustering (25 mins) [Presenters: Ian Nabney and Jaakko Peltonen]
- Generative models: mixture models and links to dimensionality reduction, Bayesian methods for generative models. Slides in PDF format.
- Hierarchical models. Slides in PDF format.
16:40-17:20 Analysis of multivariate graphs and graph mining (45 mins) [Presenter: Daniel Archambault]
- Community finding approaches
- Evaluating community and NMI (Normalized Mutual Information)
- Multivariate graphs and graph mining
- Integrating ML and visualization
- Software activity about community finding and NMI
- Slides in PDF format.
17:20-17:45 Data lab: bring your own data to be analyzed with help from the presenters (25 mins) [Presenter: all]
- Participants are invited to bring their own data to the tutorial and experiment on that data with techniques presented in the tutorial. Publically available data sets will also be provided.
17:45 Closing of the tutorial

Course notes and materials

The slides and materials will be made available online before the tutorial.

Organizers

Ian Nabney is the Director of the System Analytics Research Institute and Head of both the Computer Science and Mathematics departments at Aston University. He received his BA in Mathematics from Oxford University and a PhD in Mathematics from Cambridge University. He has over 20 years’ experience in machine learning research, has published more than 80 papers (1900 citations), and is the system architect for the Netlab pattern analysis toolbox, which has been downloaded more than 40,000 times since 1999 (the accompanying book has been through three reprints), and the Data Visualisation and Modelling System (DVMS) which integrates data projection and information visualisation techniques to provide a rich interactive environment for data exploration and visual analytics. DVMS will be used for the demonstrations of generative models. He has won grants worth more than 3M GBP from EPSRC, the EU, TSB, and industry and has supervised 11 PhD students to completion. He is the Chair of the Natural Computing Applications Forum, a principal mechanism in the UK for exchange of ideas between academics and industry on natural computing technology and practical applications.

Jaakko Peltonen is an associate professor of statistics (data analysis) at the School of Information Sciences, University of Tampere; he is also currently academy research fellow at Aalto University where he is a PI of the Statistical Machine Learning and Bioinformatics research group. He received his D.Sc. from Helsinki University of Technology in 2004. He is an associate editor of Neural Processing Letters and an editorial board member of Heliyon. He has served in organising committees of seven international conferences and one international summer school and in program committees of 24 international conferences/workshops, and has referee duties for numerous international journals and conferences. He is an expert in statistical machine learning methods for exploratory data analysis, visualisation of data, and learning from multiple sources.

Daniel Archambault received his PhD in Computer Science from the University of British Columbia, Canada in 2008. He is currently a Lecturer of Computer Science at Swansea University in the United Kingdom. During his post-doctoral studies at University College Dublin, he applied his expertise in information visualisation to help visualise the results of machine learning approaches, particularly in the area of social media visualisation. This work inspired him to co-chair the AAAI ICWSM Workshop on Social Media Visualisation (SocMedVis 2012 and 2013). His other areas expertise primarily lie in graph visualisation and drawing as well as perceptual factors in information visualisation.

Acknowledgements

This tutorial was first discussed at the Dagstuhl seminar 15101 Bridging Information Visualization with Machine Learning.