Traffic Accidents in London City: Perspective of Two Personas
There are nearly 9 million people living in London according to data.gov.uk. Among these 9 million, there are different personas who make up the daily traffic interactions in the city. This study aims to derive useful, and hopefully actionable, insights from the dataset that consists of all traffic accidents in Greater London for particularly two personas, the commuters and the health workers.
There is no doubt that in the pre-pandemic period, the commuters were the ones who experienced the most intense hours in the city in their use of transportation, whilst the health workers have faced the busiest times amid the COVID-19 pandemic. This is why this analysis focuses on these two groups of people while proposing areas to improve the health and safety conditions of local people in a broader sense. The questions to be answered at the end of this analysis are below:
For the commuter persona:
What are the riskiest districts and streets for commuters?
Are there identifiable temporal patterns in traffic accident frequency in identified districts and streets?
For the health authorities persona:
Are there any risky areas in terms of health services?
Is it possible to optimise the working hours of health workers in these areas?
Data
The traffic accident-related data were collected from data.gov.uk, the official open data platform of the UK and the hospital data was collected from NHS website datasets. The traffic accidents data containing information about accidents, vehicles and casualties for the years 2019 and 2020 has been downloaded as CSV files and uploaded into a variety of necessary tools for the analysis.
Analysis
The datasets were cleaned and prepared for analysis after the exploratory analysis was completed. All analytical tasks to prepare the data for visualisation were performed in a Jupyter notebook, which can be accessed from the corresponding Github repository. The main python libraries used are pandas and numpy for data wrangling, sklearn for clustering & classification, and plotly for visualisation.
The aggregated data was exported in intermediary steps to perform visualisations in Tableau, one of the three leaders of business intelligence platforms according to Gartner’s Magic Quadrant. Data visualisations are used for analytical reasoning and as potential inputs for the next steps throughout the analysis.
The detailed process flow of the analysis can be seen below and the paper of the study can be accessed here.
Results
Characteristics of Traffic Accidents
Spatial Distribution of Accidents
The data was first visualised to observe the spatial distribution of all the accidents in the UK for the time period between January 2019 and December 2020. Every blue point corresponds to a single accident in the visual. The Greater London area, where the analysis focuses on, is the densest area in terms of accidents compared to the other regions in the UK.
Temporal Distribution of Accidents
To review the temporal distribution of the data, a line chart with the year breakdown was plotted. It is obvious that the trends of both years 2019 (orange) and 2020 (blue) are very similar except for the steep decrease in March 2020, which corresponds to the first lockdown of the COVID-19 pandemic.
Commuter Persona
Temporal Distribution of Accidents in Week Days
After the necessary date-based and time-based filtering were done, the temporal distribution of the accidents was plotted on a bar chart, this time to observe a more granular level: hours of the day. Bars were colour-coded according to the mean accident severity. There is an observable temporal pattern for the time period between 7 am — 9 am in the morning and 16 pm — 18 pm in the afternoon.
The Riskiest Roads and Areas for Commuters
The data was filtered one more time for the peak hours identified in the step before and the DBSCAN method was utilised to create clusters for the riskiest roads and areas for both driver and cyclist commuters. Progressive DBSCAN clustering was aimed by selecting appropriate minimum samples and epsilon parameters to identify road-like and linear-shaped clusters. In the end, the riskiest roads and areas for commuters were identified. While the riskiest areas for commuters who drive are Lambeth, Enfield and Croydon, the riskiest areas for cyclist commuters are Wandsworth, Lambeth and Southwark.
The Riskiest Time Periods for the Riskiest Roads and Areas
The temporal distribution for the peak hours was plotted only for the coloured cluster data items identified in the previous step for both cyclists and drivers. Leaving home and office early while commuting could reduce the risk of the accident for cyclists since the accident frequency increases as it gets late inside the peak hours. For drivers, while leaving home early could reduce the risk when going to work, the accident frequency has a more uniform distribution when going home.
Health Worker Persona
Spatial Distribution of Districts and Hospitals
The accident data was aggregated by district and mean casualty severity was calculated for every district. Every district is represented by a circle, the size of which corresponds to the mean casualty severity for that district. Every plus sign in the visual represents a hospital. With the assistance of this visual, it is possible to observe the spatial distribution of districts and hospitals in Greater London, however, the view is still cluttered and it is hard to make any inference about the risks in different regions.
Hazard Clusters
To create a more condensed view for identifying risks, districts and hospitals were combined and clustered together using K-means clustering to cover all the data points in the dataset. The elbow-curve method was used to decide the optimum number of clusters, which is 6. The centroids of the 6 clusters were plotted on a map. For the size and the colour intensity of the data points, a new hazard measure was derived using the number of hospitals and the mean casualty severity of the districts included in clusters. The bigger and the darker the data point is, the riskier the regions represented by data points are. This visual leaves us with the information that clusters 2, 4 and 5 represents the regions with the highest risk in terms of health services, i.e. hospital capacity may be inadequate in these regions at certain times.
Temporal Distribution of The Riskiest Regions
To identify the time periods when the hospital capacity may not be adequate, the temporal distribution of the accidents was plotted on a heatmap for the riskiest regions identified in the previous step. This visual suggests that while the casualty severity tends to be higher between 8 pm — 12 am in the Southwest region, it tends to be higher between 12 am — 4 am in the West region. In the South region, on the other hand, casualty severity is distributed more homogeneously to all hours of a day. Based on these results, the working hours of the health workers in the hospitals in these risky regions can be adjusted accordingly.
Work / Health / Safety
Work is important, yet what is more important is your health and safety. This analysis revealed some important insights for the commuters to be more aware of the risk status of the roads and the time periods in peak hours, and for the health workers to be mindful of the time periods with a potentially higher risk in terms of traffic accident casualties to more effectively manage the workload in hospitals.
It does not matter whether you are a cyclist crossing the London Bridge every morning or a doctor driving to The Royal London Hospital. When it comes to any area that we have the data of (traffic accidents in our case), it is for your own safety to look for previous analyses and extracted insights to inform our decisions. This will eventually make us better decision-makers who are more data-driven and informed about our environment. And the last quote (a little bit cliche but still holds):
Information is power only if you can take action with it.
Daniel Burrus
Leading Futurist. Strategic Advisor. Disruptive Innovation Expert.