How we define a dataset

When we are ready to define a dataset we create a dataset schema in our specification repository.

This definition of a dataset defines the type of dataset it is, what fields it should have and how we categorise it.

We create a markdown file for each dataset. The frontmatter should contain all the attributes of a dataset, except the text attribute. We use the content of the markdown file for the text attribute. This allows us to write a more detailed description of the dataset than we’d be able to in the frontmatter.

Attributes of a dataset definition

Attribution
Acknowledges where the data comes from and indicates whose set of usage rules should be followed.
Collection
Defines the collection. If empty, the platform will not collect and process data for this dataset.
Dataset
The reference of the dataset, used throughout the planning data ecosystem. It’s usually a kebab case version of the dataset name.
Description
A short description of what the dataset represents.
End-date
When the dataset is no longer used an end-date can be added. This shows historically it used to exist but is not used now.
Entry-date
Add the date in format YYYY-MM-DD when you added the dataset schema to specification.
Entity-minimum
Every record in the planning.data platform has an entity number. This attribute indicates the lowest number the platform should start issuing numbers from.
Entity-maximum
Every record in the planning.data platform has an entity number. This attribute indicates the highest end of the range the platform should use to issue numbers.
Fields
Defines the fields that make up each record in the dataset. Include all the fields that should be part of the record as it should be on the platform.
Key-field
This is a deprecated field that used to be used to explicitly specify a key-field for the dataset.
Licence
The terms and conditions under which users can use, share and distribute the dataset. The licences we use are defined in specification/content/licence.
Name
The name of the dataset.
Paint-options
JSON that defines the colour and fill of geometries on the national map. Candidate to remove from specification.
Plural
The pluralised version of the name. It’s used by planning.data when it needs to refer to the plural of the dataset.
Phase
The phase of development for this dataset and any associated specification, for example alpha. The values allowed are defined in the specification repository.
Prefix
Usage unknown. Leave blank.
Realm
A way to group datasets by areas of concern to the platform team. For example, datasets that are for the pipeline or datasets that are about the provenance mechanism. The realms are defined in the specification repository.
Start-date
An optional start-date in the format YYYY-MM-DD. Put a future date if you want a dataset to only be ‘active’ from that date.
Text
This attribute is used to provide an expanded description of the dataset. It’s the text part of the markdown file. It’s used as the summary on the dataset pages on planning.data.gov.uk (for example: https://www.planning.data.gov.uk/dataset/ancient-woodland).
Themes
High level DLUHC-related groupings for the dataset. It allows the department to know what sort of datasets are covered. The themes are defined in the specification repository.
Typology
The classification of the type of thing the datasets are modelling. For example geography is used for things that contain a geospatial element.
Version
The version number of the dataset definition. It allows us to keep track of the field changes over time. You can read more about how we version our standards and datasets.
Wikidata
This is the wikidata identifier (if it exists) for what the dataset is modelling. For example, Q3078732 for ancient woodland.
Wikipedia
Path part of url to wikipedia page about what the dataset covers. For example, Ancient_woodland for ancient woodland because the wikipedia page is https://en.wikipedia.org/wiki/Ancient_woodland.

Common fields

Every dataset we define has a common set of fields.

These are:

  • Entity
  • Name
  • Notes
  • Prefix
  • Reference
  • End-date
  • Entry-date
  • Start-date

If the dataset needs any more fields then we try to reuse fields we have used elsewhere. And, as a last resort we’ll create new fields.

When we create a new field, we need to first define the field. All our fields are defined in markdown files in the fields directory in the specification repository.