Disclosure of Invention
In order to solve the technical problems, the invention provides a sound scene prediction model training method and a sound scene prediction method which are integrated with multi-source city data, and solves the problem that the prior art has limitation in measuring sound scenes.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
In a first aspect, the present invention provides a method for training a sound scene prediction model fused with multi-source city data, including:
acquiring stored environmental data in a training geographic area, and carrying out statistical analysis on the environmental data to obtain sound and scene training characteristics of the training geographic area;
Applying a neural network model to the sound scene training characteristics to obtain sound scene training labels which are output by the neural network model and are aimed at the training geographic area;
obtaining a sound scene actual measurement label of the training geographic area, calculating a loss function of the neural network model according to the sound scene actual measurement label and the sound scene training label, and training the neural network model according to the loss function to obtain a sound scene prediction model.
In one implementation, the acquiring the saved environmental data in the training geographic area and performing statistical analysis on the environmental data to obtain the sound scene training feature of the training geographic area includes:
Acquiring interest point data information, road data, building data, street view images and greening data in the stored environment data in the training geographic area;
carrying out statistical analysis on the interest point data information, the road data, the building data, the street view image and the greening data to obtain the number of interest points, the interest point density, the road density, the shortest distance between a central position and a road, the building density, the building volume rate, the green land coverage rate, the nearest green land distance, the nearest green land area and the street view image element ratio in the training geographic area;
And taking the number of the interest points, the interest point density, the road density, the shortest distance, the building density, the building volume rate, the greenbelt coverage rate, the nearest greenbelt distance, the nearest greenbelt area and the street view image each element ratio as the sound view training characteristics of the training geographic area.
In one implementation, the neural network model includes a feature screening layer, a shared hidden layer, a sound intensity hidden layer, and a sound source hidden layer, where the feature screening layer is cascaded with the shared hidden layer, the sound intensity hidden layer, and the sound source hidden layer respectively, the shared hidden layer is cascaded with the sound intensity hidden layer, and the sound source hidden layer, and the applying the neural network model to the sound scene training feature, to obtain a sound scene training tag output by the neural network model and aimed at the training geographic area, includes:
inputting the sound scene training features into a feature screening layer to obtain sharing features, sound intensity features and sound source features screened by the feature screening layer from the sound scene training features, wherein the sharing features are sound scene features related to sound intensity and sound source, the sound intensity features are sound scene features related to sound intensity only, and the sound source features are sound scene features related to sound source only;
After the sharing feature is input to the sharing hidden layer, the sound intensity feature is input to the sound intensity hidden layer, the sound source feature is input to the sound source hidden layer, a sound intensity training label output by the sound intensity hidden layer and a sound source training label output by the sound source hidden layer are obtained, and the sound intensity training label and the sound source training label are used as sound scene training labels.
In one implementation, the obtaining the actually measured sound scene tag of the training geographic area, calculating a loss function of the neural network model according to the actually measured sound scene tag and the actually measured sound scene tag, and training the neural network model according to the loss function to obtain a predicted sound scene model includes:
Acquiring a sound intensity actual measurement tag and a sound source actual measurement tag in the sound scene actual measurement tag;
calculating a sound intensity loss function according to the difference between the sound intensity training label and the sound intensity actually-measured label;
calculating a sound source loss function according to the sound source training label and the sound source actual measurement label;
And adjusting parameters of the neural network model according to the sound intensity loss function and the sound source loss function until the sound intensity loss function and the sound source loss function are smaller than set values so as to obtain a sound scene prediction model.
In one implementation, the calculating the sound intensity loss function according to the difference between the sound intensity training tag and the sound intensity actually measured tag includes:
And calculating the sound intensity mean square error formed by the sound intensity training label and the sound intensity actually-measured label, and taking the sound intensity mean square error as a sound intensity loss function.
In one implementation, the calculating the sound source loss function according to the sound source training tag and the sound source actual measurement tag includes:
Calculating each logarithmic function of each sound source training label;
Multiplying each logarithmic function by each corresponding sound source actual measurement label to obtain each intermediate result;
and carrying out weighted calculation on each intermediate result to obtain a sound source loss function.
In a second aspect, an embodiment of the present invention further provides a sound scene prediction method, where the sound scene prediction model is applied, and the sound scene prediction method includes:
Acquiring environment data in a geographic area to be detected, and carrying out statistical analysis on the environment data to obtain sound scene characteristics in the geographic area to be detected;
and applying a sound scene prediction model to the sound scene characteristics to obtain a sound source prediction label and a sound intensity prediction label which are output by the sound scene prediction model.
In a third aspect, an embodiment of the present invention further provides a training device for a sound scene prediction model fused with multi-source city data, where the training device includes the following components:
The feature statistics module is used for acquiring the saved environment data in the training geographic area and carrying out statistical analysis on the environment data to obtain sound scene training features of the training geographic area;
the prediction module is used for applying a neural network model to the sound scene training characteristics to obtain sound scene training labels which are output by the neural network model and are aimed at the training geographic area;
the training module is used for acquiring the sound scene actual measurement label of the training geographic area, calculating a loss function of the neural network model according to the sound scene actual measurement label and the sound scene training label, and training the neural network model according to the loss function so as to obtain a sound scene prediction model.
In a fourth aspect, an embodiment of the present invention further provides a terminal device, where the terminal device includes a memory, a processor, and a scenario prediction model training program that is stored in the memory and can be run on the processor and that fuses multi-source city data, and when the processor executes the scenario prediction model training program that fuses multi-source city data, the steps of the scenario prediction model training method that fuses multi-source city data are implemented.
In a fifth aspect, an embodiment of the present invention further provides a computer readable storage medium, where a sound scene prediction model training program for fusing multi-source city data is stored on the computer readable storage medium, where the sound scene prediction model training program for fusing multi-source city data is executed by a processor, to implement the steps of the sound scene prediction model training method for fusing multi-source city data.
The beneficial effects are that: according to the invention, the environmental data is subjected to statistical analysis to obtain characteristics related to the sound scene, namely, sound scene training characteristics, and the neural network model is trained based on the sound scene training characteristics so as to obtain a sound scene prediction model. And acquiring environment data in the geographic area to be predicted, which needs to be predicted, and applying a sound scene prediction model to the environment data, wherein the prediction model outputs sound scene information of the geographic area to be predicted. The environment data of the invention is the existing data, and the scene information can be predicted without on-site acquisition, namely, only the existing environment data is adopted, thereby expanding the application scene of the scene prediction method. In addition, the invention adopts the sound scene prediction model to predict the sound scene information, so that the sound scene can be comprehensively and accurately predicted.
Detailed Description
The technical scheme of the invention is clearly and completely described below with reference to the examples and the drawings. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The research shows that the sound scene is a sound environment perceived, experienced or understood by individuals or groups, covers all sounds in the environment, accurately knows the sound scene in the environment, and can provide technical support for improving urban sound scene quality and protecting resident health. The prior art often uses acoustic measurement instruments to capture the sound scenery in the environment, and acoustic measurement instruments have geographical limitations in that they cannot be placed in all geographical areas to measure the sound scenery in that area.
In order to solve the technical problems, the invention provides a sound scene prediction model training method and a sound scene prediction method which are integrated with multi-source city data, and solves the problem that the prior art has limitation in measuring sound scenes. In the specific implementation, firstly, the saved environment data in a training geographic area are acquired, and the environment data are subjected to statistical analysis to obtain sound and scene training characteristics of the training geographic area; then, applying a neural network model to the sound scene training characteristics to obtain a sound scene training label which is output by the neural network model and aims at the training geographic area; finally, obtaining a sound scene actual measurement label of the training geographic area, calculating a loss function of the neural network model according to the sound scene actual measurement label and the sound scene training label, and training the neural network model according to the loss function to obtain a sound scene prediction model.
The method for training the sound scene prediction model fused with the multi-source city data can be applied to terminal equipment, and the terminal equipment can be a terminal product with a data acquisition function, such as a computer and the like. In this embodiment, as shown in fig. 1, the method for training the sound scene prediction model fusing the multi-source city data specifically includes the following steps:
s100, acquiring stored environment data in a training geographic area, and carrying out statistical analysis on the environment data to obtain sound scene training characteristics of the training geographic area.
And S200, applying a neural network model to the sound scene training characteristics to obtain sound scene training labels which are output by the neural network model and are aimed at the training geographic area.
S300, obtaining a sound scene actual measurement tag of the training geographic area, calculating a loss function of the neural network model according to the sound scene actual measurement tag and the sound scene training tag, and training the neural network model according to the loss function to obtain a sound scene prediction model.
In one embodiment, step S100 includes the following specific steps S101 and S102:
s101, acquiring interest point data information, road data, building data, street view images and greening data in the stored environment data in the training geographic area.
The point of interest data information, road data, building data, street view image, and greening data in fig. 2 constitute multi-source city data.
Acquiring point of interest data information through a Goldmap open platform and a hundred-degree map open platform, wherein the point of interest data information comprises point of interest POI data and interest surface AOI data in fig. 2, the point of interest POI comprises a mall, a school and a station, and the interest surface AOI data is used for representing the density or the area of the point of interest POI.
And downloading and acquiring the OSM road and building data through an Open STREET MAP website. And obtaining street view images on the hundred-degree map panoramic platform. The urban green space area is extracted by classifying the sentinel second high-resolution remote sensing image.
And S102, carrying out statistical analysis on the interest point data information, the road data, the building data, the street view image and the greening data to obtain the number of interest points, the interest point density, the road density, the shortest distance between a central position and a road, the building density, the building volume rate, the green land coverage rate, the nearest green land distance, the nearest green land area and the street view image ratio in the training geographic area. And taking the number of the interest points, the interest point density, the road density, the shortest distance, the building density, the building volume rate, the greenbelt coverage rate, the nearest greenbelt distance, the nearest greenbelt area and the street view image each element ratio as the sound view training characteristics of the training geographic area.
Number of points of interest
P i is the number of ith points of interest in the grid that trains the geographic area, e.g., P i is the number of schools or businesses in the training geographic area.
Point of interest Density
A i is the area of the ith point of interest in the training geographic area, and S is the total area of the training geographic area.
Road density D R:
L is the total length of the link within the training geographic area.
Shortest distance
R i,j is the j-th road of the i-th level, dist (c, r i,j) is the distance from the center point c to all r i,j in the training geographic area.
Building density D B:
n is the total number of buildings in the training geographic area and a b,i is the base area of the ith building.
Building volume ratio FAR:
n is the total number of buildings in the training geographic area and a f,i is the surface area of the ith building.
Green land coverage FVC:
NVDI soil is NVDI value of the pure bare soil covering pixel, and NVDI veg is NVDI value of the pure vegetation covering pixel.
Nearest green space distance Dist G:
g i is the ith greenbelt outside the training geographic area, dist (c, g i) is the distance between the center point c and g i within the training geographic area.
Recent greenfield area a g:
Is the greenbelt area corresponding to the nearest greenbelt distance Dist G from the center point c in the training geographic area.
Street view image each element ratio
The area value of the ith type of pixel, that is, the area occupied by the ith type of pixel on the street view image, T n is the total area of the street view image, that is, T n is the sum of all pixels of the street view image.
In one embodiment, the neural network model in S200 is shown in fig. 2, and includes a feature screening layer, a shared hiding layer, a sound intensity hiding layer, and a sound source hiding layer, where the feature screening layer is cascaded with the shared hiding layer, the sound intensity hiding layer, and the sound source hiding layer, and the shared hiding layer is cascaded with the sound intensity hiding layer and the sound source hiding layer, respectively. Wherein the feature screening layer is the feature engineering layer in fig. 2. Step S200 includes the following specific steps S201 and S202:
S201, inputting the sound scene training features into a feature screening layer to obtain sharing features, sound intensity features and sound source features screened by the feature screening layer from the sound scene training features, wherein the sharing features are sound scene features related to sound intensity and sound source, the sound intensity features are sound scene features related to sound intensity only, and the sound source features are sound scene features related to sound source only.
The feature screening layer screens and classifies the sound scene features in the step S102, and divides sound source features, sound intensity features and sharing features. Wherein the sharing feature comprises the road density D R in the step S102, the shortest distance to the secondary roadGreen land coverage FVC, traffic site density, landscape density; the sound intensity characteristics comprise the nearest green area A g and the ratio of each element of street view imageThe sound source characteristics include the shortest distance to the primary roadShortest distance to three-level roadShortest distance to level four roadBuilding density D B, building volume rate FAR, point of interest DensityNumber of points of interest
S202, after the sharing feature is input to the sharing hidden layer, the sound intensity feature is input to the sound intensity hidden layer, the sound source feature is input to the sound source hidden layer, a sound intensity training label output by the sound intensity hidden layer and a sound source training label output by the sound source hidden layer are obtained, and the sound intensity training label and the sound source training label are used as sound scene training labels.
Namely, firstly inputting the sharing feature, and after the sharing feature reaches the sharing hidden layer, respectively inputting the sound source feature and the sound intensity feature into the sound source hidden layer and the sound intensity hidden layer, so that the three features are input into the neural network model in batches.
The three hiding layers of the shared hiding layer, the sound intensity hiding layer and the sound source hiding layer are all composed of convolution layers. The shared hidden layer receives the characteristics subjected to characteristic selection screening, and for the characteristics required by the two tasks of sound source prediction and sound intensity prediction, the shared hidden layer (full connection layer) is designed to extract the characteristics of the similar parts of the two tasks. The sound intensity hiding layer and the sound source hiding layer belong to special feature layers, the special feature layers adopt ways of respectively designing two hiding layers (full-connection layers) after sharing the hiding layers for the special feature training of the two tasks so as to realize the learning of the feature of the two tasks which are different.
Because the training sample size is smaller, and meanwhile, the risk of overfitting is easily caused by a plurality of fully-connected layers, one Drop-out layer is respectively added behind the sound source hiding layer and the sound intensity hiding layer so as to enhance the generalization of the model, a softmax activation function is arranged on an output layer cascaded with the sound source hiding layer, and a linear activation function is arranged on an output layer cascaded with the sound intensity hiding layer.
The sound intensity training label output by the neural network model is shown in fig. 3, that is, the neural network model predicts the sound intensity, and the sound intensity is used as the sound intensity training label. Meanwhile, the neural network model also outputs a sound source training label, namely, the sound source category is represented by a digital label, the sound source category is shown in the table 1,
TABLE 1
Sound source class |
Detailed category |
Natural sound |
Bird song, wave song, insect song, wind leaf song |
Human voice |
Talking sound, moving sound, broadcasting sound, and sounding sound |
Traffic sound |
Sound and whistling for motor vehicle running |
Mechanical sound |
Construction sound, cargo handling sound |
In one embodiment, step S300 includes the following specific steps S301 to S306:
s301, acquiring a sound intensity actual measurement tag and a sound source actual measurement tag in the sound scene actual measurement tag.
The sound intensity actual measurement tag and the sound source actual measurement tag are the sound intensity magnitude and sound source category of the training geographic area, wherein the sound intensity magnitude and the sound source category are acquired in the following manner:
and (3) directing the sound level meter to a sound source, recording longitude and latitude of a sampling point, the number of the sampling point and the place where the sampling point is located, and recording and synchronously presenting data by means of an application program on mobile equipment where the sound level meter is located.
The actual measurement tag of the sound intensity is the average value of the sound intensity of each sampling point.
S302, calculating the sound intensity training labelAnd the sound intensity mean square error RMSE formed by the sound intensity actually measured label y i is used as a sound intensity loss function.
Where n is the total number of sampling points.
S303, calculating each log function log (p i) of each sound source training label p i.
S304, multiplying each logarithmic function by each corresponding sound source actual measurement label y i to obtain each intermediate result y ilog(pi).
S305, weighting calculation is carried out on each intermediate result to obtain a sound source loss function H (y, p):
S306, adjusting parameters of the neural network model according to the sound intensity loss function and the sound source loss function until the sound intensity loss function and the sound source loss function are smaller than set values, so as to obtain a sound scene prediction model.
In model training, the loss functions of the two tasks of sound intensity prediction and sound source prediction are separated, the minimum values of the two tasks are respectively estimated by using back propagation, and when the two loss functions obtain the minimum values, the model training is completed.
In one embodiment, the model training is completed, and the acoustic scene prediction model obtained by the training is evaluated, specifically, the model sound source prediction task is evaluated by adopting the functions of accuracy Acc, accuracy PREC, recall rate REC and F1, and the model sound intensity prediction task is evaluated by adopting the function of R 2.
In the formula, TP is the quantity of the consistent sound source prediction labels and sound source actual measurement labels output by the model, FP is the quantity of the sound source prediction labels which are not equal to the quantity of the sound source actual measurement labels, FN is the quantity of the sound source prediction labels which are output by the model and are equal to the quantity of the non-sound source actual measurement labels, TN is the quantity of the non-sound source prediction labels which are output by the model and are equal to the quantity of the sound source actual measurement labels.
In the formula,The average value of the tag is measured by sound intensity.
The accuracy Acc, accuracy PREC, recall REC, F1 and R 2 functions and RMSE functions are used to construct a composite index epsilon:
ε=w1·ACCnorm+w2·PECnorm+w2·RECnorm+w3·F1norm+w4·RMSEnorm+w5·R2 norm
In the formula, ACC norm is the normalization of ACC, PEC norm is the normalization of accuracy PREC, REC norm is the normalization of REC, F1 norm is the normalization of F1, RMSE norm is the normalization of RMSE, R 2 norm is the normalization of R 2.
The normalized formula is as follows:
X is any one of the six indexes, X min is the minimum value of any one of the indexes, and X max is the maximum value of any one of the indexes.
By balancing the contributions of the different indices, the overall performance index ε provides a more comprehensive performance assessment in the sound scene prediction.
In conclusion, the sound scene prediction module constructed by the invention can simultaneously process sound source classification and sound intensity prediction tasks, namely, simultaneously predict sound source classification and sound intensity. The model is used for predicting the urban sound scene of the unknown area, so that the comprehensive and accurate prediction of the urban sound scene is realized.
The method and the system fully utilize the information of different dimensions of the city, including the data of buildings, road networks, streetscapes and the like, and make up the defect of a single data source, thereby capturing the characteristic information of the urban sound scene more comprehensively and carrying out more efficient and accurate urban sound scene prediction.
The invention establishes the shared characteristic layer and the special characteristic layer by utilizing the characteristic relation between the sound source category and the sound intensity, thereby realizing the task of simultaneously predicting the sound source category and the sound intensity. After training by the small sample, the model can predict a large-range urban sound scene only by the urban sound scene characteristics under the condition of no need of sound data, and the comprehensive efficiency of sound scene prediction is improved.
The embodiment also provides a training device for the sound scene prediction model fused with the multi-source city data, as shown in fig. 4, the training device comprises the following components:
the feature statistics module 01 is used for acquiring the saved environment data in the training geographic area and carrying out statistical analysis on the environment data to obtain sound scene training features of the training geographic area;
The prediction module 02 is used for applying a neural network model to the sound scene training characteristics to obtain a sound scene training label which is output by the neural network model and aims at the training geographic area;
The training module 03 is configured to obtain a sound scene actual measurement tag of the training geographic area, calculate a loss function of the neural network model according to the sound scene actual measurement tag and the sound scene training tag, and train the neural network model according to the loss function to obtain a sound scene prediction model.
Based on the above embodiment, the present invention also provides a terminal device, and a functional block diagram thereof may be shown in fig. 5. The terminal equipment comprises a processor, a memory, a network interface and a display screen which are connected through a system bus. Wherein the processor of the terminal device is adapted to provide computing and control capabilities. The memory of the terminal device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the terminal device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a method of training a sound scene prediction model that incorporates multi-source city data. The display screen of the terminal device may be a liquid crystal display screen or an electronic ink display screen.
It will be appreciated by those skilled in the art that the functional block diagram shown in fig. 5 is merely a block diagram of a portion of the structure associated with the present inventive arrangements and is not limiting of the terminal device to which the present inventive arrangements are applied, and that a particular terminal device may include more or less components than those shown, or may combine some of the components, or have a different arrangement of components.
In one embodiment, a terminal device is provided, where the terminal device includes a memory, a processor, and a scenario prediction model training program that is stored in the memory and can be run on the processor and that fuses multi-source city data, and when the processor executes the scenario prediction model training program that fuses multi-source city data, the following operation instructions are implemented:
acquiring stored environmental data in a training geographic area, and carrying out statistical analysis on the environmental data to obtain sound and scene training characteristics of the training geographic area;
Applying a neural network model to the sound scene training characteristics to obtain sound scene training labels which are output by the neural network model and are aimed at the training geographic area;
obtaining a sound scene actual measurement label of the training geographic area, calculating a loss function of the neural network model according to the sound scene actual measurement label and the sound scene training label, and training the neural network model according to the loss function to obtain a sound scene prediction model.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SYNCHLINK) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.