Air quality and resident trip visual analysis method and system
Technical Field
The invention relates to a large data drive-based air quality and resident trip visual analysis method and system.
Background
Along with the development of the industrialized process in China, the pollution problem of the industrial excrement mainly comprising sulfide (SOx), nitride (NOx), ozone (O3), carbide (COx) and particulate matters (the particle size is less than or equal to 10 microns and 2.5 microns) to the air quality is increasingly serious, and the pollution problem greatly influences the daily travel and the life of people.
With the development of science and technology, data is collected and stored in large quantities, the data volume is increased explosively, and how to extract valuable information from the data becomes an urgent problem to be solved. In the face of large and complex data, traditional data mining and data analysis methods are not compelling to explore the data. In order to obtain the value contained in the data, various data analysis and mining methods are applied.
Therefore, an effective method for solving these problems is needed. In recent years, as an analysis reasoning science based on a visual interactive interface, visual analysis provides a brand new means for data mining and data analysis, and the visual analysis is popular with researchers due to the characteristics of interactivity, visibility and the like and is gradually a research hotspot.
Therefore, the visual research aiming at the air quality and the travel of the residents has important significance for researching the relationship between the air quality and the travel of the residents, not only can provide important reference for exploring the travel behaviors of the residents, but also can cause the attention of relevant departments such as transportation, medical treatment and the like to the air quality. Therefore, the visual research for exploring the air quality and the resident trip has very important research value in both theory and practical application.
Disclosure of Invention
The invention designs an air quality and resident trip visual analysis method and system based on big data drive aiming at the problems of air quality and resident trip analysis, better helps departments of transportation, medical treatment and the like to analyze the air quality and the resident trip, provides a set of visual analysis system to help a user to analyze air quality characteristics and resident trip characteristics, displays an air quality bar graph, a temperature box graph, a POI (point of interest) activity stacking graph and flow graph, a POI (point of interest) activity migration rate calendar thermal graph and a multidimensional histogram and explores urban air quality and resident trip. The purpose of the invention is realized by the following technical scheme: a big data drive-based air quality and resident travel visual analysis method comprises the following steps:
(1) original air quality data, temperature data, POI data and taxi taking difficulty data are reconstructed: the method comprises the steps of firstly, respectively carrying out data cleaning and sorting on air quality data, temperature data, POI data and taxi taking difficulty and degree data, wherein the data cleaning mainly comprises the steps of searching and removing data abnormity and missing values in various data sources, and then sorting all data according to time according to a timestamp, so that the visualization of subsequent time sequence data is facilitated. The taxi taking difficulty data comprise geographic coordinates and weights of taxi taking difficulty distribution points. The POI data comprises the geographic coordinates of the POI distribution points and the POI types.
(2) Calculating the POI zone weight activity and the deviation rate: the POI weighted activity reflects the flow of people around the POI; the offset rate reflects the change of the POI zone weight activity.
The calculation of the POI zone right activity is specifically as follows:
and (2.1) calculating the Euclidean distance between the taxi taking difficulty distribution points and each POI distribution point, judging whether the Euclidean distance is smaller than a preset threshold value T, and if the condition is met, setting the weight of the taxi taking difficulty distribution points as the weight of the POI activity.
And (2.2) respectively counting the accumulated sum of the activity degrees of the POI of various types according to different types of the POI, and taking the accumulated sum as the weighted activity degree of the POI of the type.
The calculation of the POI zone weight activity offset rate specifically includes:
Offsett=(POIWeightt-Averweek,hour)/(POIWeightt)-1
wherein, Averweek,hourPOI weighted average of activity for each hour of each week, POIWeighttTaking the weighted activity, Offset, for the current hour POItIs the offset rate.
3) Same type POI clustering: calculating all POI distribution points within the range that the Euclidean distance around each driving difficulty distribution point is less than or equal to T, and recording as POIdidi. Statistical POIdidiAnd (4) calculating the position of the clustering center of the POI distribution points of the same type, and setting the weight of the distribution points with difficulty and easiness in taxi taking as the weight of the clustering center. And clustering the POI distribution points by using a k-means-based clustering algorithm, and taking the calculated new longitude and latitude coordinates of the clustering center as the longitude and latitude coordinates of the center position of the POI.
4) Visual analysis of air quality and resident's trip specifically is:
(4.1) color visual coding: when mapping the color, due to the difference of the Air Quality Index (AQI), a dynamic mapping scheme is adopted, namely, the color is dynamically adjusted according to the air quality index value:
wherein the ColorrectIs a rectangular fill color.
(4.2) strip-box plot analysis component: the air quality index for each day is shown as a rectangle with the order of the rectangles from left to right indicating the day's day, the fill color of the rectangle being determined according to the protocol of step 4.1 and the height being determined according to the air quality index AQI. The boxplot represents the temperature every hour of the week, the boxplot shows the date and time of the week from left to right, the upper dotted line and the lower dotted line of the boxplot respectively represent the upper quarter data range and the lower quarter data range, the small rectangle in the center of the boxplot represents the data range from one quarter to three quarters of the place, and the horizontal line position in the center of the small rectangle represents the median of the data.
(4.3) flowsheet-stacking diagram analysis component: the abscissa of the stacked graph and the flowsheet refers to the hourly coordinate of the timing range and takes the weekly scale as the basic scale. The ordinate is the POI weighted activity value. The stacked graph represents different types of POI by using area graphs with different colors, is arranged along a coordinate axis on one side and shows the change condition of the one or more POI with the right activity within a specified time range. And the flow graphs are arranged along the two sides of the coordinate, and the change condition of the one or more POI (point of interest) with the right activity within the appointed time range is displayed.
(4.4) scatter matrix-GeoMap-calendar heatmap analysis component: the scatter matrix diagram is an expansion of the high-dimensional aspect of the scatter diagram and is used for displaying air quality, temperature and POI (point of interest) zone authority activity. The calendar heat map presents the multidimensional data in a two-dimensional form, and the size of the numerical value is represented by the shade of color, and the change of the POI tape weight activity offset rate under different air quality and temperature conditions of the same POI is displayed through the calendar heat map. The GeoMap is used for displaying the activity weight and the geographic distribution condition of the POI clusters of the same type.
A big data drive-based air quality and resident trip visual analysis system comprises the following components:
(1) bar-box plot analysis assembly: the air quality index of each day is shown by a rectangle, and the sequence of the rectangles from left to right represents the sequence of the days; the height of the rectangle is determined according to the air quality index AQI, and the filling color adopts a dynamic mapping scheme, namely, the height is dynamically adjusted according to the air quality index value:
wherein the ColorrectIs a rectangular fill color.
The boxplot represents the temperature every hour of the week, the boxplot shows the date and time of the week from left to right, the upper dotted line and the lower dotted line of the boxplot respectively represent the upper quarter data range and the lower quarter data range, the small rectangle in the center of the boxplot represents the data range from one quarter to three quarters of the place, and the horizontal line position in the center of the small rectangle represents the median of the data.
(2) Flowsheet-stacking diagram analysis component: the abscissa of the stacked graph and the flowsheet refers to the hourly coordinate of the timing range and takes the weekly scale as the basic scale. The ordinate is the POI weighted activity value. The stacked graph represents different types of POI by using area graphs with different colors, is arranged along a coordinate axis on one side and shows the change condition of the one or more POI with the right activity within a specified time range. The flow graph is arranged along the two sides of the coordinate, the change situation of one or more POI (point of interest) belt weight activeness in a specified time range is displayed, and the calculation of the POI belt weight activeness specifically comprises the following steps:
and (2.1) calculating the Euclidean distance between the taxi taking difficulty distribution points and each POI distribution point, judging whether the Euclidean distance is smaller than a preset threshold value T, and if the condition is met, setting the weight of the taxi taking difficulty distribution points as the weight of the POI activity.
And (2.2) respectively counting the accumulated sum of the activity degrees of the POI of various types according to different types of the POI, and taking the accumulated sum as the weighted activity degree of the POI of the type.
(3) Scatter matrix-GeoMap-calendar heat map analysis component: the scatter matrix diagram is an expansion of the high-dimensional aspect of the scatter diagram and is used for displaying air quality, temperature and POI (point of interest) zone authority activity. The calendar heat map presents the multidimensional data in a two-dimensional form, and the size of the numerical value is represented by the shade of color, and the change of the POI tape weight activity offset rate under different air quality and temperature conditions of the same POI is displayed through the calendar heat map. The GeoMap is used for displaying the activity weight and the geographic distribution condition of the POI clusters of the same type.
The calculation of the liveness weight of the POI clusters of the same type is specifically as follows: calculating all POI distribution points within the range that the Euclidean distance around each driving difficulty distribution point is less than or equal to T, and recording as POIdidi. Statistical POIdidiAnd (4) calculating the position of the clustering center of the POI distribution points of the same type, and setting the weight of the distribution points with difficulty and easiness in taxi taking as the weight of the clustering center. And clustering the POI distribution points by using a k-means-based clustering algorithm, and taking the calculated new longitude and latitude coordinates of the clustering center as the longitude and latitude coordinates of the center position of the POI.
The invention has the beneficial effects that: the method is different from the traditional air quality visualization, and aims at the visualization of the air quality and the data of the residents during traveling, so that a user can explore the change situation of the activity of the air quality to different areas of a city from the global to the local and then to the global, and the change of traveling destinations of the residents influenced by the air quality is analyzed. Through the interactive means, the cost of using the system by an analyst is reduced, a good display effect is achieved, and the system can display various rules of air quality and resident trip from four levels of air quality, temperature, POI zone authority activity and offset rate.
Drawings
FIG. 1 bar-box plot analysis component;
FIG. 2 flow sheet-stacking diagram analysis component;
FIG. 3 is a scatter matrix-GeoMap-calendar heatmap analysis component;
FIG. 4 is a front-end dependency diagram of the system.
Detailed Description
The following detailed description is made with reference to the embodiments and the accompanying drawings.
The data base on which the present invention is based is: the air quality data is issued by environment protection administrative departments or environment monitoring stations authorized by the administrative departments at various levels and above, and comprises daily reports and time reports. The time period of the time report data is 1 hour, the real-time report of each monitoring station is issued at each integral point moment, and the indexes of the real-time report comprise SO2、NO2、O3、CO、PM2.5、PM10Concentration, daily data is one day SO2、NO2、O3、CO、 PM2.5、PM1024 hour mean concentration; the atmospheric environment data is issued by the meteorological protection administrative departments at different levels and above or the meteorological monitoring stations authorized by the meteorological protection administrative departments, and comprises daily reports and time reports. The time period of the time report data is 1 hour, the real-time report of each detection station is issued every whole time, and indexes of the real-time report comprise air pressure, temperature, humidity, precipitation, wind direction and other data. The daily data is the average value of 24-hour data of daily air pressure, temperature, humidity, precipitation and wind direction; the resident trip data is driving difficulty data provided by a drop-and-dome-shaped large data platform, wherein the data time period is 1 hour, and driving difficulty of different places is provided at each integral point. Each piece of integer data includes: longitude, latitude, difficulty of taxi taking; the POI distribution data is detailed data of the POI and comprises a POI address, a POI name, a POI longitude, a POI latitude and a POI type.
The invention provides a big data drive-based air quality and resident trip visual analysis method, which comprises the following steps:
(1) original air quality data, temperature data, POI data and taxi taking difficulty data are reconstructed: the method comprises the steps of firstly, respectively carrying out data cleaning and sorting on air quality data, temperature data, POI data and taxi taking difficulty and degree data, wherein the data cleaning mainly comprises the steps of searching and removing data abnormity and missing values in various data sources, and then sorting all data according to time according to a timestamp, so that the visualization of subsequent time sequence data is facilitated. The taxi taking difficulty data comprise geographic coordinates and weights of taxi taking difficulty distribution points. The POI data comprises the geographic coordinates of the POI distribution points and the POI types.
(2) Calculating the POI zone weight activity and the deviation rate: the POI weighted activity reflects the flow of people around the POI; the offset rate reflects the change of the POI zone weight activity.
The calculation of the POI zone right activity is specifically as follows:
(2.1) calculating the Euclidean distance between the difficulty and difficulty degree distribution points of taxi taking and each POI distribution point, judging whether the Euclidean distance is smaller than a preset threshold value T, wherein the T can be 0.5km, and if the condition is met, setting the weight of the difficulty and difficulty degree distribution points of taxi taking as the weight of the POI activity.
And (2.2) respectively counting the accumulated sum of the activity degrees of the POI of various types according to different types of the POI, and taking the accumulated sum as the weighted activity degree of the POI of the type.
The calculation of the POI zone weight activity offset rate specifically includes:
Offsett=(POIWeightt-Averweek,hour)/(POIWeightt)-1
wherein, Averweek,hourPOI weighted average of activity for each hour of each week, POIWeighttTaking the weighted activity, Offset, for the current hour POItIs the offset rate.
3) Same type POI clustering: calculating all POI distribution points within the range that the Euclidean distance around each driving difficulty distribution point is less than or equal to T, and recording as POIdidi. Statistical POIdidiAnd (4) calculating the position of the clustering center of the POI distribution points of the same type, and setting the weight of the distribution points with difficulty and easiness in taxi taking as the weight of the clustering center. And clustering the POI distribution points by using a k-means-based clustering algorithm, and taking the calculated new longitude and latitude coordinates of the clustering center as the longitude and latitude coordinates of the center position of the POI.
4) Visual analysis of air quality and resident's trip specifically is:
(4.1) color visual coding: when mapping the color, due to the difference of the Air Quality Index (AQI), a dynamic mapping scheme is adopted, namely, the color is dynamically adjusted according to the air quality index value:
wherein the ColorrectIs a rectangular fill color.
(4.2) strip-box plot analysis component: the air quality index for each day is shown as a rectangle with the order of the rectangles from left to right indicating the day's day, the fill color of the rectangle being determined according to the protocol of step 4.1 and the height being determined according to the air quality index AQI. The boxplot represents the temperature every hour of the week, the boxplot shows the date and time of the week from left to right, the dotted lines on the boxplot represent the upper quarter data range and the lower quarter data range respectively, the small rectangle in the center of the boxplot represents the data range from one quarter to three quarters of the quartile, and the horizontal line position in the center of the small rectangle represents the median of the data, as shown in fig. 1.
(4.3) flowsheet-stacking diagram analysis component: the abscissa of the stacked graph and the flowsheet refers to the hourly coordinate of the timing range and takes the weekly scale as the basic scale. The ordinate is the POI weighted activity value. The stacked graph represents different types of POI by using area graphs with different colors, is arranged along a coordinate axis on one side and shows the change condition of the one or more POI with the right activity within a specified time range. The flow graph is arranged along the coordinate on both sides, and shows the change situation of the one or more POI (point of interest) with the right activity within the specified time range, as shown in FIG. 2.
(4.4) scatter matrix-GeoMap-calendar heatmap analysis component: the scatter matrix diagram is an expansion of the high-dimensional aspect of the scatter diagram and is used for displaying air quality, temperature and POI (point of interest) zone authority activity. The calendar heat map presents the multidimensional data in a two-dimensional form, and the size of the numerical value is represented by the shade of color, and the change of the POI tape weight activity offset rate under different air quality and temperature conditions of the same POI is displayed through the calendar heat map. The GeoMap is used for showing the activity weight and the geographic distribution of the POI clusters of the same type, as shown in fig. 3.
A big data drive-based air quality and resident trip visual analysis system comprises the following components:
(1) bar-box plot analysis assembly: the air quality index of each day is shown by a rectangle, and the sequence of the rectangles from left to right represents the sequence of the days; the height of the rectangle is determined according to the air quality index AQI, and the filling color adopts a dynamic mapping scheme, namely, the height is dynamically adjusted according to the air quality index value:
wherein the ColorrectIs a rectangular fill color.
The boxplot represents the temperature every hour of the week, the boxplot shows the date and time of the week from left to right, the dotted lines on the boxplot represent the upper quarter data range and the lower quarter data range respectively, the small rectangle in the center of the boxplot represents the data range from one quarter to three quarters of the quartile, and the horizontal line position in the center of the small rectangle represents the median of the data, as shown in fig. 1.
(2) Flowsheet-stacking diagram analysis component: the abscissa of the stacked graph and the flowsheet refers to the hourly coordinate of the timing range and takes the weekly scale as the basic scale. The ordinate is the POI weighted activity value. The stacked graph represents different types of POI by using area graphs with different colors, is arranged along a coordinate axis on one side and shows the change condition of the one or more POI with the right activity within a specified time range. The flow graph is arranged along the coordinate on both sides, and shows the change situation of the one or more POI (point of interest) with the right activity within the specified time range, as shown in FIG. 2. The calculation of the POI zone right activity is specifically as follows:
and (2.1) calculating the Euclidean distance between the taxi taking difficulty distribution points and each POI distribution point, judging whether the Euclidean distance is smaller than a preset threshold value T, and if the condition is met, setting the weight of the taxi taking difficulty distribution points as the weight of the POI activity.
And (2.2) respectively counting the accumulated sum of the activity degrees of the POI of various types according to different types of the POI, and taking the accumulated sum as the weighted activity degree of the POI of the type.
(3) Scatter matrix-GeoMap-calendar heat map analysis component: the scatter matrix diagram is an expansion of the high-dimensional aspect of the scatter diagram and is used for displaying air quality, temperature and POI (point of interest) zone authority activity. The calendar heat map presents the multidimensional data in a two-dimensional form, and the size of the numerical value is represented by the shade of color, and the change of the POI tape weight activity offset rate under different air quality and temperature conditions of the same POI is displayed through the calendar heat map. The GeoMap is used for showing the activity weight and the geographic distribution of the POI clusters of the same type, as shown in fig. 3.
Liveness weight for same type POI clusteringThe value calculation is specifically: calculating all POI distribution points within the range that the Euclidean distance around each driving difficulty distribution point is less than or equal to T, and recording as POIdidi. Statistical POIdidiAnd (4) calculating the position of the clustering center of the POI distribution points of the same type, and setting the weight of the distribution points with difficulty and easiness in taxi taking as the weight of the clustering center. And clustering the POI distribution points by using a k-means-based clustering algorithm, and taking the calculated new longitude and latitude coordinates of the clustering center as the longitude and latitude coordinates of the center position of the POI.
In the preprocessing process of the method, the calculation of the POI weighted activity degree is mainly carried out by counting the accumulated sum of the number of POIs of different types around each taxi taking difficulty degree point so as to obtain the measurement of the POI weighted activity degree; the POI weighted activity deviation rate is mainly used for counting the deviation condition of the real-time POI activity relative to the historical POI weighted activity mean value. By drawing a column-box diagram, a stack-flow diagram and a scatter matrix-GeoMap-calendar heat map, a user can provide important reference for exploring travel behaviors of residents through interaction among various visual views, can also bring importance to air quality of related departments such as transportation and medical treatment, and provides constructive opinions for the related departments.
While the invention has been described with respect to a single embodiment, showing the various aspects of the useful visualization components, it will be apparent that the invention is not limited to the embodiment described, but is capable of numerous modifications without departing from the basic spirit and scope of the invention.