Multiorder hydrologic Position for Europe — a Set of Features for Machine Learning and Analysis in Hydrology – Scientific Data

oooussama

Underlying dataset

The generation of this dataset is based on two datasets, first the “EU-Hydro–River Network Database” version v01315 and “EU-Hydro–Coastline” version v01322 with the advantage that data dependencies are low. From these two datasets, the four data layers (1) river network, (2) surface water bodies, (3) river basins/study area and 4) coastline were derived (see Table 1). Due to this relatively low input data requirements, it is possible to transfer the presented methodology to other regions or datasets with only little effort.

Table 1 Overview of the required input data.

The “EU-Hydro–River Network Database”15 as well as the “EU-Hydro–Coastline”22 has been manually downloaded from the Copernicus – Land Monitoring Service website (see Fig. 4a). The river network data is split into two GeoPackage (.gpkg) files for each of the 35 major river basins in the EEA39 countries, one with the naming scheme “drainage_network_<river name>_public_beta_v009.gpkg” and the second with “euhydro__v011.gpkg”. The coastline data is stored in a single Shapefile (.shp) file (see Fig. 4b). All files have a total size of approximately 14 GB when unzipped.

Fig. 4
figure 4

Workflow of the data processing in different software.

The single .shp file containing the coastline has a size of 288 MB. For instructions on accessing this underlying data, see Usage Notes.

Processing

The generation of the presented dataset requires several computationally expensive processing steps. For this reason and to make the methods more reproducible and maintainable, all processing steps are executed and controlled by a processing pipeline in the R programming language using the targets package (Fig. 4c)23,24. This processing or targets pipeline can be seen as programming script tracking changes in the source code and the data with the major advantage among many that it can skip processing steps that are still up-to-date and re-executes those that need to be updated. Due to the large memory size requirements for this dataset as well as for computational speed reasons, a PostgreSQL database with the PostGIS extension is used for certain processing steps of vector data and a GRASS GIS database is used for all final raster-based calculations of the EU-MOHP20 metrics (Fig. 4d,e). The calculations in the databases are also tracked and executed by the processing pipeline. In the following, the relevant steps of the methods are described. For a fully comprehensive description of all details, we refer to the source code itself (see Code availability).

Data preparation

In the following, the most relevant processing steps are described. These steps are part of the previously described pipeline and are defined as so-called targets in the source code of the pipeline. To simplify the description, the processing steps are grouped here according to the previously mentioned data layers.

Study area

The preprocessing steps to define and generate the study area are described first because it is required for the processing of all other data layers. The study area also defines the spatial coverage of the final product. For the generation of the study area, the layer *_eudem2_basins_h1* in the previously mentioned GeoPackage file with the naming scheme containing the suffix “drainage_network” (see Table 1) is used. It contains polygon geometries representing sub-basins of the major the river basins. Firstly, all polygon geometries belonging to European oversea territories such as the French islands in the Caribbean are removed. Then, the remaining polygons are merged. Subsequently, out of these polygons of contiguous land masses the 10 largest polygons by area are chosen as study area.

River network

The river or hydrographic network is based on the linestring geometries from the layer River_Net_l in the previously mentioned GeoPackage file with the naming scheme containing the suffix “euhydro” (see Table 1). This data layer requires more processings steps than the other three data layers. Firstly, specific linestring geometries are removed from the river network. These linestrings comprise all geometries categorized as canal or ditch in the attribute column dfdd encoded with the values BH020 for canal and BH030 for ditch25. These are mainly removed for the following two reasons: Firstly, many of the canal and ditch geometries have missing stream order values, which is required for the following processing steps and secondly, it is assumed that canals are often hydraulically disconnected from the natural hydrological system because of their impermeable side walls and canal bed. Besides this, the overall importance of canals and ditches is low when comparing their number of geometries to the number of river geometries (difference of three orders of magnitude). Furthermore, all linestring geometries categorized as non-perennial rivers in the attribute column hyp encoded with the values 2 (intermittent), 3 (ephemeral) and 4 (dry) are removed25. After this filtering, more than 1.05 million geometries remain. Then, missing and invalid stream order values are imputed with the value 1 as first stream order. This ensures that related geometries are at least included in the first hydrologic order. Subsequently, the river network geometries are clipped to the study area.

The next essential processing step implements a method to obtain linestring geometries that represent the mainstems of the river networks as described in the Supplementary of Belitz et. al. (2019). A mainstem is defined here as the longest path from the head water to the next most distant river mouth (see geometries with the same levelpath_id in Fig. 5b). In Fig. 5b the concept mainstems is schematically shown. In this figure, a mainstem consists of linestring geometries with the same levelpath_id. Belitz et al.14 made use of the column “LevelPathID” in their underlying NHDPlusV2 river network dataset26,27. As a comparable column does not exist in the “EU-Hydro–River Network Database” dataset15, its generation is a required preprocessing step. This step is especially essential when applying these methods to river network data that does not provide suitable columns to generate the mainstems from. The generation of this required column levelpath_id for the river network dataset15 involves the following steps. Firstly, a river network is derived separately for each hydrologic order by keeping only geometries with a stream order equal or greater than the specific hydrologic order as described in Background & Summary (see also Fig. 2). The following steps are repeated for each hydrologic order. The river network is sorted by the column longpath in descending order. The column longpath indicates the length of the path from the start node of a linestring geometry to the end node of the most downstream geometry of the river network. Then, starting with the top geometry, all line geometries are determined that are connected with each other by means of the columns object_id and nextdownid.The column object_id provides an unique ID for every linestring geometry and nextdownid indicates the object_id of the next downstream geometry. The now identified linestrings constitute the longest mainstem and are removed from the original river network. This is now iteratively repeated for the second top linestring in the remaining river network and so on.

Fig. 5
figure 5

Schematic representation of the river network and its linestring geometries before generating the mainstems (a), after the identification of mainstems including the column levelpath_id (b) and after merging the linestring geometries by column levelpath_id and adding a feature_id column (c).

Subsequently, the column levelpath_id is added as a unique ID for all geometries belonging to the same mainstem (Fig. 5b). The geometries of the respective river network are then merged based on this column (see difference in linestring geometries between Fig. 5b,c). This results in a river network for each hydrologic order separately with a reduced number of geometries as multiple geometries are now summarised into mainstems.

The next step addresses the occurrence of flow splits in the river network. A flow split or divergence is defined here as junction of linestring geometries with more than one linestring geometry representing out-flowing streams (orange marks in Fig. 6). To transfer the methods from Belitz et al.14 for the calculation of EU-MOHP20, it is required to remove minor flow paths that originate from such divergences from the river network. A classification of linestring geometries into major and minor flow paths is not directly provided by any column in the underlying river network dataset. Belitz et al.14 used the column divergence for removing all minor flow paths. Here, this is achieved by removing all linestring geometries that intersect other linestrings with both the end and start node. The removal of these minor flow paths is not done for the first hydrologic order to include all linestrings in at least one order. The implementation of these steps pointed out errors in the river network dataset15. These errors are related to errors of values in the columns longpath and nextdownid. Based on visual inspection, they occur in the french river networks of Garonne, Loire and Seine and are corrected programmatically during processing.

Fig. 6
figure 6

Schematic representation of the river network and its linestring geometries including divergences before (a) and after (b) the removal of minor paths. The linestring geometry with the feature_ids 7 and 8 have been removed from the river network in B, because they intersect other linestring geometries with both, the start and end node.

Then, the river networks are sorted by the length of the linestring geometries in descending order and provided with an unique ID for each geometry in the column feature_id (see feature_id in Fig. 5c).

Surface water bodies

The surface water bodies are derived from the layer InlandWater in the GeoPackage file with the naming scheme containing the suffix “euhydro” (see Table 1). A filter is applied to retain only the geometries of surface water bodies that have an area greater than four times the area of the grid cell. Another filter is applied to remove all geometries that do not intersect with the river network geometries. Since the river networks of the 9 hydrologic orders differ from each other, this second filter is applied individually for each of the river networks. This results in a dataset of surface water bodies for each hydrologic order.

Coastline

The data layer coastline is derived from the Shape file related to the “EU-Hydro–Coastline” dataset22 (see Table 1). Like rivers, the ocean, defined by the coastline, is an area where water accumulates and therefore its spatial representation is necessary for the generation of this dataset14.

Firstly, the polygon geometries of the underlying Shape file are merged. Then, a buffer of 3000 m is added to the merged geometries. This is necessary to ensure that the outline of the study area intersects with the coastline polygon geometries for the next step. Without this buffer, discrepancies between the study area and the coastline can be noticed. These discrepancies would lead to undesired results after the next step. The value of 3000 m is derived from visual inspection. The resulting multipolygon geometries are intersected with the outline of the study area to obtain the coastline as linestring. Those parts of the study area that do not intersect with the polygon geometries are categorized as “administrative borders over land”. This intersection then ensures that the coastline exactly aligns with the study area outline. The resulting coastline is shown in Fig. 7. The coastline is then added to each river network of all hydrologic orders.

Fig. 7
figure 7

Map showing location and the spatial distribution of coastline and administrative borders over land resulting from the preprocessing.

EU-MOHP

After obtaining all four required data layers as described previously, the next and last processing step comprises multiple smaller steps with the final goal to calculate and export the EU-MOHP20 metrics. Because the processing is analogous for all hydrologic orders and all of the 10 polygon geometries of the study area, this step is described only once in general terms. As all processing steps described below require grid based computations, a GRASS GIS database is used (see Fig. 4e).

The four data layers study area, river network including the coastline and surface water bodies of the respective hydrologic order and the coastline are written into the GRASS GIS database. The projection of the GRASS GIS database is set to the ETRS89 Lambert Azimuthal Equal-Area projection coordinate reference system (EPSG: 3035). The spatial resolution of the raster cells is set to 30 m.

Thiessen catchments and distance to stream (DS)

As described in Background & Summary, the catchment boundaries are required to determine DD (see Eqs. (1, 2) or Fig. 2). Therefore, Thiessen divides are used. A Thiessen divide is the outline of a Thiessen catchment which in turn is the area containing all points in a river network to which a river is closer than any other river28. One major advantage is that Thiessen divides can be calculated purely based on the river network itself while avoiding issues such as closed lows in the resulting metrics14. This advantage outweighs the numerous minor problems associated with DEM-based catchments, especially when taking into account the uncertain correspondence of the subsurface catchment to the surface catchment. A detailed discussion on the preference of Thiessen divides over topographic divides is provided in Belitz et. al. (2019), section 2.2.014. In order to obtain Thiessen divides, the first step is to calculate the euclidean distance from each raster cell center to the nearest river network geometry. The resulting distances correspond to DS in Eqs. (1–3) or Fig. 2). This step also determines the feature ID of the nearest geometry for all raster cells. Then, the polygons representing Thiessen catchments are derived by merging all raster cells that are assigned to the same feature ID. Finally, the outlines of these polygons are used as Thiessen divides.

Distance to divide (DD)

To obtain the distance to divide (DD) for each raster cell, the distance from each raster cell center to the nearest Thiessen divide is calculated. But the determination of the nearest Thiessen divide cannot be achieved by a simple nearest neighbour search as it is used for the calculation of DS and the feature ID of the nearest river. Implementing the physical reality that in catchments the water accumulates and runs off in rivers requires an additional condition. This condition has to ensure that distances to the nearest divide are not calculated across rivers. In other words the nearest Thiessen divide for each raster cell must not lie on the other side of the river. In other words, when drawing a imaginary line between the nearest Thiessen divide and the grid cell center, this line must not cross a river geometry (see black line versus red line in Fig. 13). Without this condition the geometric center line of the Thiessen catchments would be considered as areas of accumulation and discharge. To meet this condition, the GRASS GIS command r.walk was used. Minor inaccuracies regarding this command for the described purpose are noted in Technical Validation. The calculated distances correspond to DD in Eqs. (1, 2) or Fig. 2.

Measures DSD, LP and SD

Based on the two calculated raster layers containing the distances DS and DD, the three EU-MOHP measures DSD, LP and SD are now calculated by the application of the equations Eqs. (1–3) and the GRASS GIS raster map calculator (“r.mapcalc”). In order to reduce the storage size, the raster values of the measure LP are multiplied by a factor of 10,000 and rounded to be able to store them as integer values with two decimal digits. The two measures DSD and LP are rounded to the nearest integer. Finally, the resulting raster layers for LP, DSD and SD are exported from the GRASS GIS database and stored on disk as GeoTIFF files with.tif file extension.

Data descriptor

To enhance the reproducibility of the data descriptor manuscript itself, it is generated as part of the processing pipeline. Also all tables and all data derived figures are created from within the pipeline. This ensures that all figures are up-to-date and reflect the most recent state of the methods. The descriptor is written in RMarkdown from which a LaTeX and a PDF file are generated using the knitr package29,30.

#Multiorder #hydrologic #Position #Europe #Set #Features #Machine #Learning #Analysis #Hydrology #Scientific #Data

sidaliii