PUBLICATION

SynPop-DE: Synthetic population of 40 million German households using generative neural networks

FutureLab Social Metabolism & Impacts, PIK Potsdam & Humboldt University Berlin

Jakob Napiontek , Peter-Paul Pichler
Preprint (SocArXiv) DOI: 10.31235/osf.io/zha8v_v1

Abstract

Household microdata combining socio-demographic, housing, income and expenditure attributes are a core resource for many studies in quantitative social science, such as modelling the household-level impacts of the energy transition. Yet no such data are openly available for Germany's full population. SynPop-DE provides a synthetic population of 40,235,916 households and their 81,629,116 members in all 400 German districts, calibrated to the 2022 census, with 34 attributes per household. Synthetic households are generated by estimating the joint attribute distribution of the German Household Budget Survey through a two-stage machine learning architecture. While an autoencoder first compresses high-dimensional categorical data into a continuous latent space, a generative adversarial network subsequently learns to sample new records from this representation. These records are then aligned with census marginals for all German districts using iterative proportional updating to ensure spatial representativeness. Validation along three dimensions confirms that the model learns attribute relationships and generates synthetic households that reproduce the statistical properties of the survey data (fidelity), supports downstream analyses with accuracy comparable to the original survey (utility), and prevents disclosure of individual respondents (privacy). The dataset is openly available at https://synpop.de.

Key Figures

82M Synthetic persons
40M Households
400 Districts
34 Attributes per household

Data Sources

Household Budget Survey (EVS)

Training data for the generative model. Contains household composition, consumption expenditures, income, and building characteristics.

Census 2022

Marginal distributions for all 400 districts — used for IPU calibration to align synthetic households with the real population structure.