Interactive PCA Explorer

Before uploading your data, check that it is clean, especially ensure that the the numeric variables contain only the digits 0-9 or NA (to indicate missing data).

Rows that contain one or more NAs will be excluded from the PCA.

Columns that contain a mixture of numbers and text will not be included in the computation of the PCA results.

Have a look at the iris.csv file included with this app to see what a clean CSV file looks like.


Select the options that match your CSV file, then upload your file:


After uploading your CSV file, click on the 'Inspect the data' tab

The tableplot below (it will take a few seconds to appear) may be useful to explore the relationships between the variables, to discover strange data patterns, and to check the occurrence and selectivity of missing values.


Here is a summary of the data


Here is the raw data from the CSV file


This plot may take a few moments to appear when analysing large datasets. You may want to exclude highly correlated variables from the PCA.


Summary of correlations

Among SPSS users, these tests are considered to provide some guidelines on the suitability of the data for a principal components analysis. However, they may be safely ignored in favour of common sense. Variables with zero variance are excluded.


Here is the output of Bartlett's sphericity test. Bartlett's test of sphericity tests whether the data comes from multivariate normal distribution with zero covariances. If p > 0.05 then PCA may not be very informative


          

Here is the output of the Kaiser-Meyer-Olkin (KMO) index test. The overall measure varies between 0 and 1, and values closer to 1 are better. A value of 0.6 is a suggested minimum.


        

Choose the columns of your data to include in the PCA.

Only columns containing numeric data are shown here because PCA doesn't work with non-numeric data.

The PCA is automatically re-computed each time you change your selection.

Observations (ie. rows) are automatically removed if they contain any missing values.

Variables with zero variance have been automatically removed because they're not useful in a PCA.


Select options for the PCA computation (we are using the prcomp function here)

Scree plot

The scree plot shows the variances of each PC, and the cumulative variance explained by each PC (in %)


PC plot: zoom and select points

Select the grouping variable.

Only variables where the number of unique values is less than 10% of the total number of observations are shown here (because seeing groups with 1-2 observations is usually not very useful).


Select the PCs to plot


Click and drag on the first plot below to zoom into a region on the plot. Or you can go directly to the second plot below to select points to get more information about them.

Then select points on zoomed plot below to get more information about the points.

You can click on the 'Compute PCA' tab at any time to change the variables included in the PCA, and then come back to this tab and the plots will automatically update.


Click and drag on the plot below to select points, and inspect the table of selected points below


Details of the brushed points


        

The code for this Shiny app is online at https://github.com/benmarwick/Interactive_PCA_Explorer . Please post any feedback, question, etc. as an issue on github .

The text is licensed CC-BY and the code MIT .