How we define a dataset
When we are ready to define a dataset we create a dataset schema in our specification repository.
This definition of a dataset defines the type of dataset it is, what fields it should have and how we categorise it.
We create a markdown file for each dataset. The frontmatter should contain all the attributes of a dataset, except the text
attribute. We use the content of the markdown file for the text
attribute. This allows us to write a more detailed description of the dataset than we’d be able to in the frontmatter.
Attributes of a dataset definition
- Attribution
- Acknowledges where the data comes from and indicates whose set of usage rules should be followed.
- Collection
- Defines the collection. If empty, the platform will not collect and process data for this dataset.
- Dataset
- The reference of the dataset, used throughout the planning data ecosystem. It’s usually a kebab case version of the dataset name.
- Description
- A short description of what the dataset represents.
- End-date
- When the dataset is no longer used an end-date can be added. This shows historically it used to exist but is not used now.
- Entry-date
-
Add the date in format
YYYY-MM-DD
when you added the dataset schema to specification. - Entity-minimum
- Every record in the planning.data platform has an entity number. This attribute indicates the lowest number the platform should start issuing numbers from.
- Entity-maximum
- Every record in the planning.data platform has an entity number. This attribute indicates the highest end of the range the platform should use to issue numbers.
- Fields
- Defines the fields that make up each record in the dataset. Include all the fields that should be part of the record as it should be on the platform.
- Key-field
- This is a deprecated field that used to be used to explicitly specify a key-field for the dataset.
- Licence
- The terms and conditions under which users can use, share and distribute the dataset. The licences we use are defined in specification/content/licence.
- Name
- The name of the dataset.
- Paint-options
- JSON that defines the colour and fill of geometries on the national map. Candidate to remove from specification.
- Plural
- The pluralised version of the name. It’s used by planning.data when it needs to refer to the plural of the dataset.
- Phase
- The phase of development for this dataset and any associated specification, for example alpha. The values allowed are defined in the specification repository.
- Prefix
- Usage unknown. Leave blank.
- Realm
- A way to group datasets by areas of concern to the platform team. For example, datasets that are for the pipeline or datasets that are about the provenance mechanism. The realms are defined in the specification repository.
- Start-date
-
An optional start-date in the format
YYYY-MM-DD
. Put a future date if you want a dataset to only be ‘active’ from that date. - Text
- This attribute is used to provide an expanded description of the dataset. It’s the text part of the markdown file. It’s used as the summary on the dataset pages on planning.data.gov.uk (for example: https://www.planning.data.gov.uk/dataset/ancient-woodland).
- Themes
- High level DLUHC-related groupings for the dataset. It allows the department to know what sort of datasets are covered. The themes are defined in the specification repository.
- Typology
- The classification of the type of thing the datasets are modelling. For example geography is used for things that contain a geospatial element.
- Version
- The version number of the dataset definition. It allows us to keep track of the field changes over time. You can read more about how we version our standards and datasets.
- Wikidata
- This is the wikidata identifier (if it exists) for what the dataset is modelling. For example, Q3078732 for ancient woodland.
- Wikipedia
- Path part of url to wikipedia page about what the dataset covers. For example, Ancient_woodland for ancient woodland because the wikipedia page is https://en.wikipedia.org/wiki/Ancient_woodland.
Common fields
Every dataset we define has a common set of fields.
These are:
- Entity
- Name
- Notes
- Prefix
- Reference
- End-date
- Entry-date
- Start-date
If the dataset needs any more fields then we try to reuse fields we have used elsewhere. And, as a last resort we’ll create new fields.
When we create a new field, we need to first define the field. All our fields are defined in markdown files in the fields directory in the specification repository.