Background & Summary

Under the influence of nutrient pollution, habitat alteration, and climate change, many large shallow lakes, such as Lake Okeechobee and Lake Winnebago (USA), Lake Winnipeg (Canada), Lake Chaohu and Lake Taihu (China), Lake Kasumigaura (Japan), and Lake Peipsi (Estonia), are suffering from severe eutrophication, cyanobacterial blooms, deoxygenation, and the loss of aquatic vegetation1,2,3. Lake Taihu is the third largest freshwater lake (2,338.1 km2) in China and is located in an economic development hotspot with >40 million inhabitants in the Yangtze River Delta region. With the rapid development of the economy and increasing urbanization and population density since the 1980s, the exogenous pollution load from the watershed has been increasing in Lake Taihu, and cyanobacterial blooms occur frequently4. For example, massive cyanobacterial blooms overwhelmed the drinking water plants near Lake Taihu in May 2007, leading to millions of residents lacking drinking water for nearly one week.

With great demand for lake ecological services, the balance between lake ecological environment protection and socioeconomic development is prominent in the Lake Taihu Basin. Continuous efforts in management and protection of aquatic ecosystem have been made in Lake Taihu, particularly after the drinking water crisis in 2007. More than 4,339 chemical enterprises that did not meet emission standards were closed, and more than 3,000 household-based rural domestic sewage treatment facilities were built from 2007 to 20171. Both external and internal pollution control measures in the Taihu Basin have been vigorously promoted, the water ecosystem of Lake Taihu has improved. However, it still has not reached a healthy state5. In recent years, cyanobacterial blooms have occurred frequently, and nutrient levels are still in a high level6, which remains a limiting factor for regional sustainable development.

Research focused on the water quality and bio-optical parameters of Lake Taihu was first carried out via field measurements7,8,9,10. The Taihu station (TH station) was established in 1988 by the Taihu Laboratory for Lake Ecosystem Research (TLLER) in Nanjing Institute of Geography and Limnology, Chinese Academy of Sciences (NIGLAS), and became a member of the Chinese Ecosystem Research Network (CERN) of the Chinese Academy of Sciences. The TH station joined the National Field Scientific Observation Station (NFSOS) of China in 2001 and the Global Lake Ecosystem Observatory Network (GLEON) in 2005. Continuous observations of water quality at several sites in Meiliang Bay of Lake Taihu were first carried out by the TH station in 1998, and quarterly monitoring of the whole lake began in 200411. Routine water sampling, which has accumulated long-term historical data, is mainly used to monitor water quality parameters. In addition, satellite remote sensing and automatic field observation stations have also been used to monitor the water quality in Lake Taihu12,13. Since 2010, there have been explosive growth studies in monitoring of the aquatic ecosystem in Lake Taihu using satellite data, such as Moderate Resolution Imaging Spectroradiometer (MODIS), Operational Landsat Imager (OLI), and Ocean and Land Color Instrument (OLCI)14,15,16. Many studies on cyanobacterial blooms, aquatic vegetation, chlorophyll-a concentration (Chla), Secci disk depth (SDD), and other bio-optical parameters have illustrated the spatiotemporal dynamics of the aquatic ecosystem in Lake Taihu17,18,19,20. In addition, the influence of climate parameters, including wind, temperature, nutrients, and radiation, on the water quality and trophic state of Lake Taihu was analyzed, revealing the complex causes of cyanobacterial blooms occurrence4,6,20,21,22,23,24. However, these related studies have usually been performed independently in one individual parameter or several related parameters, leading to lack of published comprehensive data of Lake Taihu.

Based on the field data measured by the TH station and previous studies conducted by the authors, we systematically collected long-term spatiotemporal data linked to cyanobacterial blooms, including water quality, bio-optical properties, and related climate and anthropogenic factors, of Lake Taihu (THQBCA). The dataset includes field measured data of 13 parameters for more than 15 years, providing a comprehensive understanding of the water quality, phytoplankton, and zooplankton. Bio-optical parameters derived from satellite data are included to figure out their spatial and temporal variations in Lake Taihu. Additionally, meteorological and water level data from representative monitoring stations are included. Data on population density, nighttime light, and land cover across the Lake Taihu Basin were collected. This dataset represents the most comprehensive spatiotemporal integrated dataset with 26 parameters in term of water quality, natural factors, and human activities for Lake Taihu to date. Within each category, the temporal and spatial resolutions are generally consistent, except for a few variables with missing data. Despite the inconsistencies in spatiotemporal resolution and time range of these data, we have retained the longest duration of these parameters, allowing users to perform their own analyses and data extraction.

Our THQBCA dataset is expected to facilitate the analysis of spatial and temporal variations in and mechanisms driving cyanobacterial blooms in Lake Taihu. This dataset serves as a foundation for long-term algal bloom prediction, and also contributes to the assessment of the current status and existing issues of the Lake Taihu ecosystem, understanding the ecological environmental conditions and development trends. However, this dataset could not support the short-term prediction and early warning of cyanobacterial blooms, which need higher frequency, such as daily or hourly, data. Note that in order to provide long-term dataset, the satellite data for bio-optics are provided as annual mean values, which were derived from daily images. Users requiring higher frequency data can contact us or generate it themselves following methods outlined in the literatures. Moreover, this dataset could be helpful for the studies for the protection, management, and restoration of the ecological environment of eutrophic lakes.

Methods

Study area

Lake Taihu, the third largest freshwater lake in China, is a typical large shallow lake with an average water depth of 1.9 m. The Lake Taihu Basin is primarily influenced by the southeast monsoon, which results in distinct seasons, a long frost-free period, and abundant rainfall. Lake Taihu serves multiple functions, such as flood storage, irrigation, transportation, and tourism, and most importantly, it is a critical source of drinking water for residents in the surrounding cities/counties, including Shanghai, Wuxi, Suzhou, and Huzhou. Lake Taihu is characterized by the coexistence of a variety of ecological types (Fig. 1). The northern region of the lake (ZS and ML) has a typical algae-based ecosystem. The western and southwestern parts (WT and ST) have open water surfaces with turbid water and often experience algal blooms. The eastern region of the lake (XK and ET) features extensive aquatic vegetation, including submerged, floating, and emergent aquatic vegetation.

Fig. 1
figure 1

(a) Location of Lake Taihu, (b) Lake Taihu Basin, and (c) Lake Taihu. The water quality parameters were collected from the sampling points, the water level was measured at the Taihu (TH) station, and the meteorological data was measured at the Dongshan (DS) station.

Field measurements

Water samples were collected and analyzed by researchers of TH station from the 32 sampling points in Lake Taihu in February, May, August, and November from 2005 to 2020. The water quality parameters included pH, dissolved oxygen (DO), permanganate (CODMn), total phosphorus (TP), orthophosphate (PO4-P), total nitrogen (TN), ammonia nitrogen (NH3-N), nitrate nitrogen (NO3-N), and nitrite nitrogen (NO2-N). In addition, the communities of phytoplankton and zooplankton were also measured during each cruise from 2005 to 2020. The water samples (0.5 m below the water surface) collected by researchers were stored in shaded and refrigerated conditions and were brought to the laboratory for analysis following the standard methods for lake ecological survey, observation, and analysis25. The above parameters of the sampling sites located in each segment of Lake Taihu were averaged to represent the conditions of this region.

Water quality parameters

The pH was measured using a PHS-3TC instrument after calibration with standard buffer solutions. The DO and CODMn were measured using a 10 ml micro burette, and concentration calibration was performed before measurement. TP, PO4-P, TN, NO3-N, NO2-N, and NH3-N were determined using a UV-2450PC spectrophotometer, and the analysis results were controlled by running standard samples alongside.

Phytoplankton abundance and biomass

Lugol’s solution (concentration of 1%) was used for fixation and preservation of each sample. Then, the samples were concentrated, and phytoplankton classification and counting were performed using a microscope (E200, Nikon Corporation, Japan) at a magnification of 400×. The identification of phytoplankton followed the method described in the previous study26. The biomass of phytoplankton was calculated based on the cell count and cell sizes of different species of phytoplankton.

Zooplankton abundance and biomass

Quantitative samples of cladocerans and copepods were collected using a column-shaped water sampler from the upper, middle, and lower layers of the water, with 10 liters of each layer collected. After thorough mixing, the samples were filtered through a 64 μm plankton net to collect the organisms. Finally, the cells were fixed using a 4% neutral formaldehyde solution. For quantitative samples of rotifers, a mixed water sample of 1 liter was taken from the upper, middle, and lower layers. The sample was fixed using a 4% neutral formaldehyde solution and Lugol’s solution at a concentration of 1.5%. After 24 hours of settling, the sample was concentrated to 50 ml. The collected planktonic organisms were then classified, identified, and counted under a 40x microscope (Olympus Corporation, Tokyo, Japan). The biomass was calculated based on the number and size of individual organisms for each species of planktonic animal. In cases of high population density, subsampling methods were used for counting27.

Data derived from satellite remote sensing

Aquatic vegetation

Three different aquatic vegetation types (floating, emergent, and submerged vegetation) were extracted from Landsat Thematic Mapper (TM), Enhanced Thematic Mapper Plus (ETM+), and OLI data using the methods described in the previous study28. The surface reflectance data were used to calculate the emergent vegetation sensitivity index (EVSI), algae index (AI), normalized difference vegetation index (NDVI), and submerged vegetation sensitivity index (SVSI). Then, the emergent vegetation (EV), submerged vegetation (SV), and floating vegetation (FV) were classified following the decision tree described in28, and the threshold values were adjusted according to the true color composite image. One image of Landsat between May and September was selected to map the aquatic vegetation distribution in each year. The spatial resolution was 30 m, and the aquatic vegetation dataset ranged from 2000 to 2020.

Cyanobacterial blooms

The fractional floating algae cover (FAC), representing the percentage of algal blooms within a pixel, of Lake Taihu was estimated using MODIS satellite data19. The Level-1 MODIS of Aqua was downloaded from NASA GSFC (http://oceancolor.gsfc.nasa.gov) and processed to Rayleigh corrected reflectance (Rrc) data using SeaDAS (version 7.5.3). After eliminating the influence of clouds, sun glint, aquatic vegetation, and turbid water, the FAC of each pixel was then calculated following Eq. 1 and Eq. 7 in our previous study19. The yearly mean FAC of each pixel in Lake Taihu was averaged from the daily FAC of the available images in each year. The yearly mean FAC dataset ranged from 2003 to 2022, and the spatial resolution was resampled to 250 m.

SDD

The Landsat 5 TM, 7 ETM+, and 8 OLI surface reflectance images were downloaded from the Google Earth Engine platform and were processed to remote sensing reflectance (Rrs) after quality control. The relationship between Rrs(red) and SDD was determined to retrieve the SDD of Lake Taihu29. The reported mean relative error (MRE) between the measured and derived SDD values was 34.2% for various lakes in China. The yearly mean SDD of each pixel in Lake Taihu was calculated from the SDD values of the available images in each year. The yearly mean SDD dataset ranged from 1986 to 2020 with a spatial resolution of 30 m.

Chla

By downloading Level-1 cloud-free TM, ETM+, and OLI data, SeaDAS was used to calculate the Rrc for Lake Taihu. Based on 234 matched pairs between field and satellite data over Lake Taihu, XGBoost, a machine learning model, was constructed with a mean absolute percentage error (MAPE) = 35%17. This model enabled the generation of Chla distribution maps for Lake Taihu from 1984 to 2019, and yearly mean Chla values were calculated. Note that the production process of this dataset excluded areas covered by cyanobacterial blooms and aquatic vegetation.

TSI

The TSI inversion algorithm based on the algal biomass index (ABI) was designed to use the surface reflectance from Landsat series data30. After quality control, the algorithm was applied to cloud free TM, ETM+, and OLI images. Then, the TSI values of each image from 1986 to 2020 were derived for Lake Taihu, and yearly mean TSI images were obtained from all the available images within the year.

Climate data

The daily water level was monitored at TH Station starting from 2004 based on the Wusong elevation benchmark. Climate variations from 1956 to 2020 were documented from daily meteorological data from the DS station (Fig. 1). The meteorological data, including daily records of wind speed (m/s), wind direction at maximum wind speed (°), air temperature (°C), and precipitation (mm), were obtained from the China Meteorological Data Sharing Service System (http://data.cma.cn/). Daily wind speed and temperature were the mean values of four observations at 02:00, 08:00, 14:00, and 20:00 (UTC + 8).

Anthropogenic data

Human activities of the lake basin can produce anthropogenic pressures on the eco-environment and trophic state of the lake. Land cover, population density (POPD), and nighttime light (NTL) were taken as the anthropogenic data. These data were stored in the format of a tiff raster covering the Lake Taihu Basin (Fig. 1b).

Land cover

The land cover dataset for the Lake Taihu Basin was extracted from the published Landsat-derived annual China land cover dataset from 1990–202031. Based on various information, including spectral indices, temporal statistics, topography, and location, a random forest classifier was used to generate nine major land cover types: cropland, forest, shrub, grassland, water, snow and ice, barren, impervious, and wetland. The spatial resolution of the annual land cover dataset is 30 m.

Population density

The population density from 2000 to 2020 in the Lake Taihu Basin was extracted from the WorldPop dataset32. The units are number of people per pixel with country totals adjusted to match the corresponding official United Nations population estimates that have been prepared by the Population Division of the Department of Economic and social Affairs of the United Nations secretariat (https://doi.org/10.5258/SOTON/WP00660). The population density dataset was derived from the population count dataset by dividing the number of people in each pixel by the pixel area. The spatial resolution was 100 m.

Nighttime light intensity

Nighttime light data over the Lake Taihu Basin were extracted from the NPP (Suomi National Polar-orbiting Partnership) -VIIRS (Visible Infrared Imaging Radiometer Suite) -like NTL dataset. An extended time series (2000–2018) of NPP-VIIRS-like NTL data was built through a new cross-sensor calibration from DMSP-OLS (Defense Meteorological Satellite Program Operational Linescan System) NTL data (2000–2012) and a composition of monthly NPP-VIIRS NTL data (2013–2018)33. The yearly mean NTL intensities from 2000 to 2020 were downloaded from https://doi.org/10.7910/DVN/YGIVCD, and the spatial resolution is 15 arcsec (~500 m near the equator).

Data Records

The THQBCA dataset34 is available in two data formats, including tables, and tiff raster layers. The water quality parameters and climate data are stored in Microsoft Excel xlsx files. The bio-optics and anthropogenic data are stored in tiff format. Table 1 describes the details, including source, spatial resolution, temporal resolution, units, of the variables. Table 2 lists the folder, filename, and format of each variable in the dataset.

Table 1 Summary of variables for Lake Taihu reported in this study.
Table 2 Description of the data records.

Water quality parameters

The water quality parameters of different regions and months in Lake Taihu were stored in xlsx format in folder ‘1.WaterQuality’ with file named ‘1.WaterQuality.xlsx’. For each variable, the mean values in each subregion and the whole lake were listed in the xlsx file (Table 2).

Bio-optical parameters

The bio-optical parameters were located in folder ‘2.Bio-optics’ with tiff format. Aquatic vegetation from 2000 to 2020 were saved in raster data in the folder ‘2.1AquaticVegetation’, and example in 2020 was illustrated in Fig. 2a. Raster data of the other four bio-optical parameters (FAC, Chla, SDD, and TSI) were saved in different folders named by ‘TH_ParameterName_Year.tif’ (Fig. 2b–e).

Fig. 2
figure 2

Examples of (a) aquatic vegetation, (b) FAC, (c) SDD, (d) Chla, and (e) TSI in Lake Taihu, and (f) Land cover, (g) population density and (h) NTL intensity data of the Lake Taihu Basin. Note that the region ET with frequent aquatic vegetation were masked in FAC, SDD, Chla, and TSI data.

Climate data

The climate data in Lake Taihu were stored in xlsx format in folder ‘3.Climate’ with file named ‘3.Climate.xlsx’. The daily mean values of PRE, TEM, WIN, and water lever are saved in different sheets.

Anthropogenic data

The land cover data in the Lake Taihu Basin were saved in the folder named ‘4.Anthropogenic/4.1Landcover’ with tiff format. The yearly mean population density and NTL intensity in the Lake Taihu Basin were saved in ‘4.Anthropogenic/4.2POPD’ and ‘4.Anthropogenic/4.3NTL’, respectively. The examples of Land cover, POPD, and NTL intensity are mapped in Fig. 2f–h.

Technical Validation

The results of the accuracy evaluation of the parameters derived from satellite data are summarized in Table 3. Several variables are produced at the national scales, the overall accuracy of the source data is described, and the local validation is also analyzed in this section.

Table 3 Accuracy of the source data derived from satellite data.

Validation of cyanobacterial blooms using OLI data

We validated the accuracy of FAC indirectly using the randomly selected match-up pairs of MODIS and OLI. The equivalent bloom area (EBA) derived from the FAC of MODIS was compared to the bloom area of the OLI images. As the OLI has a high spatial resolution of 30 m, for example, on May 20, 2020, the OLI-derived algal bloom area was 185.9 km2 at 10:30 (UTC + 8), and the EBA was 147.4 km2 at 13:10 (UTC + 8) (Fig. 3a–d). Although there was a time gap between the two images, they showed good agreement with an R2 of 0.61 (N = 60) (Fig. 3e).

Fig. 3
figure 3

(ad) Comparison between OLI and MODIS observations of blooms on May 20, 2020. (e) Relationships between the EBA of MODIS and the BA of the OLI.

Validation of aquatic vegetation using Sentinel-2 MSI data

The aquatic vegetation derived from Landsat was validated by selecting points of different water types in matched pairs of the Sentinel 2 Multispectral Instrument (MSI) and OLI images. The confusion matrices between the validated class from MSI and the mapped class derived from the algorithm were calculated, and the overall accuracy (OA), Kappa coefficient, user accuracy (UA) and producer accuracy (PA) were obtained (Table 4). The OA was 93.36%, and the Kappa coefficient was 0.87.

Table 4 Confusion matrix between the validated class from MSI and the predicted class derived from OLI in Lake Taihu.

Cross-validation of the wind speed and temperature

The wind speed and temperature monitored by the TH station were also related to the wind speed and temperature of the ERA5 (fifth generation ECMWF atmospheric reanalysis of the global climate) dataset (https://www.ecmwf.int/en/forecasts/dataset/ecmwf-reanalysis-v5) (Fig. 4). The hourly wind speed and temperature data for Lake Taihu were downloaded from the European Centre for Medium-Range Weather Forecasts (ECMWF) ERA5 dataset, which contains the latest climate reanalysis data. The hourly wind speed of ERA5 was averaged to the daily mean wind speed and then compared to the field-measured daily averaged wind speed. The validation results showed that their wind speed values were in good agreement, with R2 = 0.65, and the wind speed at ERA5 was greater than that at TH station. In addition, the temperature had good performance, with an R2 of 0.99, and most of the validation points were at approximately the 1:1 line.

Fig. 4
figure 4

Cross-validation of (a) wind speed (W, m/s) and (b) temperature (T, °C) with field-measured data and the ERA5 dataset.

Usage Notes

The time series of the THQBCA dataset can be freely accessed at https://doi.org/10.5281/zenodo.13917285, which is stored as a zip file (881 MB). By uncompressing the zip file, the bio-optical parameters and anthropogenic data are provided in GeoTIFF format (2.64 GB). These data can be processed using open-source software such as QGIS. The field-measured data (water quality, phytoplankton, zooplankton, and climate data) were saved as Microsoft Excel XLSX files.