Thomas Friesen
2019-01-24
Public interest for prediction of elections is growing. Since the succes of the 2008 presidential election from Nate Silver from 538, forecasting for elections turned into a more statistically and data mining based method. And since the presidential election of 2016 the critics against those methods are louder than ever. In this post a look the German federal election is presented. For the german federal election, the Wahlbezirke are the more important regions instead of the federal states. A number of economic variables are available for those Wahlbezirke and in this post a look at these economic data is given. The dataset can be downloaded here. 49 variables regarding the population, different unemployment rates, number of employees and other fields are available. The data is also visualized. But visualizing all variables at once is it not really possible. Instead a PCA is applied to the dataset and the corresponding components are plotted. A PCA is a data reduction method for reducing a large dataset with many variables to a linear combination of the variables. Instead of trying to plot 49 variable, one can use the variables created by the PCA. If the variables are highly correlated, the dataset can be visualied by using the first few components and still retain most of the information of the whole dataset.
library(spatial)
library(sf)
library(rgdal)
library(dplyr)
library(mapview)
library(reshape2)
library(ggplot2)
library(kableExtra)
data_pca=daten_bezirk[,-(1:6)]
data_pca=scale(data_pca)
pca1=princomp(data_pca)
daten_bezirk$PCA1=pca1$scores[,1]
daten_bezirk$PCA2=pca1$scores[,2]
daten_bezirk$PCA3=pca1$scores[,3]
daten_bezirk$PCA4=pca1$scores[,4]
daten_bezirk$PCA5=pca1$scores[,5]
pca_plot_data=pca1$loadings[,1:8]
pca_plot_data=data.frame(WKR_NAME=rownames(pca_plot_data),pca_plot_data)
pca_plot_data=melt(pca_plot_data,id.vars="WKR_NAME")
We scale the data and use princomp for applying a PCA to the data. We then extract the loadings. The loadings tell us the weight of each variable attributed to the principal component. A principal component with very high loading for a few variables allows for an easy interpretation. We then use melt to reshape the data into a long form. This is useful for plotting the loadings.
First Component | Second Component | Third component | Forth Component | Fifth Component |
---|---|---|---|---|
0.283 | 0.267 | 0.078 | 0.05 | 0.04 |
0.283 | 0.551 | 0.630 | 0.69 | 0.73 |
The first row shows how much the comopnent explains the total variance in percent. The first component explains 28.3 %, while the second component explains about 26.7% of the total variance. The second row shows the cumulative variance explained by the component combined. The first 4 components for example explain two third of the total variance. The first two component explain more than half of the total variation and are therefore the most important components.
ggplot(pca_plot_data,aes(WKR_NAME,abs(value),fill=value))+facet_wrap(~variable,nrow=1)+
geom_bar(stat="identity")+
coord_flip()+
scale_fill_gradient2(name="Loadings",high="blue",mid="white",low="red")+
xlab("Absolute Values of the loadings")+
ylab("Variables")
The loadings in a PCA show how much a variable contributes to the correspond component. Higher loadings for a variable indicate greater importance. If a component has high loading for certain variables, these variables explain most of the corresponding variation. The graphis shows the absolute loadings for the first 8 components. The first principal component has high loadings for the unemployment variables, for the number of people with no religion. Negative loadings are for the number of foreigners, who receive welfare, the number of young people The first component is related to an economic indicator. The second principal component has high loadings for the number of car owners, the number of house owners while high negative values for the number of foreigners, the population density and the number of people between 25 and 34. The next copmonents are somehwat harder to interpret but they also explain much less of the total variation. Instead of trying to explain the component by taking a look at the loadings, a visual plot of the components is shown below.
mapview(st_sf(daten_bezirk),zcol=c("PCA1","PCA2","PCA3","PCA4","PCA5"))