Traffic fatality data sets for Austin, TX.
The traffic fatality reports pusblished by APD are hosted on their website for 2 years. Past this point they are archived and are not publicly accessible anymore.
Our automated workflow is composed of the 4 following steps:
- Generate the raw data sets.
- Import/Merge the external data sets.
- Augment the data sets.
- Merge the results into
*-all-*.json
data sets.
Using these techniques to generate the data sets, we ensure that they can always be recreated in case of problem would occur.
Each step will be detailed in a dedicated section bellow.
The data sets named fatalities-{year}-raw.json
are generated directly from ScrAPD
without any manual intervention. A data set is created for each year.
Each of these external data sets has a dedicated documentation file in the docs
folder.
- Socrata
Here is an example showing how the data is imported:
TOPDIR=$(git rev-parse --show-toplevel); for YEAR in {17..18}; do python ${TOPDIR}/tools/scrapd-importer-fatalities-socrata.py ${TOPDIR}/datasets/fatalities-20${YEAR}-raw.json ${TOPDIR}/external-datasets/socrata-apd-archives/socrata-apd-20${YEAR}.json > ${TOPDIR}/datasets/fatalities-20${YEAR}-augmented.json;done
The data sets named fatalities-{year}-augmented.json
are data sets that have been enhanced, in order to improve the
quality of the data.
# Generate empty data sets.
for f in fatalities-20{13..20}-augmented.json; do echo "[]" > "${f}"; done
Augment the data sets:
for f in fatalities-20{13..20}-augmented.json; do python "${TOPDIR}/tools/scrapd-augmenter-geocoding-geocensus.py"-i ${f}; done
The data sets are being augmented with ScrAPD augmenters that can be found in the tools
folder. Each augmenter
Currently the following augmenters are available:
- scrapd-augmenter-geocoding-geocensus
Corrections can also be added manually. Corrections can add extra fields or update values in existing fields. They are applied last.
A correction must be made in a file named augmentation-manual-{year}.json
. All the corrections for the same year
MUST be grouped together in the same file. The order does not matter. If an entry is found several times, the last
one (i.e. the lowest one in the file) will superseed all the others.
[
{
"19-0400694": {
"Type": "Pedestrian",
"Gender": "Female",
},
"19-0320079": {
"Type": "Bicyle",
"Gender": "Unknown",
}
}
]
Apply the changes for a specific year:
python tools/scrapd-merger.py -i datasets/fatalities-2020-augmented.json augmentations/2020/augmentation-manual-2020.json
The data sets whose year is all
are a combination of all the data sets of the same category.
There are generated using the following jq
command:
jq -s add fatalities-20{17..20}-raw.json > fatalities-all-raw.json
jq -s add fatalities-20{17..20}-augmented.json > fatalities-all-augmented.json
A full update happens when ScrAPD gets updated with changes that drastically improve the quality of the data which was retrieved.
ScrAPD versions which required a full update:
- 1.5.0
- 1.5.1
As a result, all the data sets were updated with the following command:
for i in {17..20}; do python tools/scrapd-merger.py -i fatalities-20${i}-raw.json <(scrapd -v --format json --from "Jan 1 20${i}" --to "Dec 31 20${i}"); done