PUBLIKATION

SynPop-DE: Synthetische Bevölkerung von 40 Millionen deutschen Haushalten mit generativen neuronalen Netzen

SynPop-DE: Synthetic population of 40 million German households using generative neural networks

FutureLab Social Metabolism & Impacts, PIK Potsdam & Humboldt-Universität zu Berlin

Jakob Napiontek , Peter-Paul Pichler
Preprint (SocArXiv) DOI: 10.31235/osf.io/zha8v_v1

Abstract

Household microdata combining socio-demographic, housing, income and expenditure attributes are a core resource for many studies in quantitative social science, such as modelling the household-level impacts of the energy transition. Yet no such data are openly available for Germany's full population. SynPop-DE provides a synthetic population of 40,235,916 households and their 81,629,116 members in all 400 German districts, calibrated to the 2022 census, with 34 attributes per household. Synthetic households are generated by estimating the joint attribute distribution of the German Household Budget Survey through a two-stage machine learning architecture. While an autoencoder first compresses high-dimensional categorical data into a continuous latent space, a generative adversarial network subsequently learns to sample new records from this representation. These records are then aligned with census marginals for all German districts using iterative proportional updating to ensure spatial representativeness. Validation along three dimensions confirms that the model learns attribute relationships and generates synthetic households that reproduce the statistical properties of the survey data (fidelity), supports downstream analyses with accuracy comparable to the original survey (utility), and prevents disclosure of individual respondents (privacy). The dataset is openly available at https://synpop.de.

Kerndaten

82M Synthetische Personen
40M Haushalte
400 Kreise
34 Attribute pro Haushalt

Datenquellen

Einkommens- und Verbrauchsstichprobe (EVS)

Trainingsdaten für das generative Modell. Enthält Haushaltskomposition, Konsumausgaben, Einkommen und Gebäudetypen.

Zensus 2022

Randverteilungen für alle 400 Kreise — Grundlage der IPU-Kalibrierung der synthetischen Haushalte auf die reale Bevölkerungsstruktur.