Traffic Accidents in London City: Perspective of Two Personas

Emre Can Okten
7 min readFeb 9, 2022

There are nearly 9 million people living in London according to data.gov.uk. Among these 9 million, there are different personas who make up the daily traffic interactions in the city. This study aims to derive useful, and hopefully actionable, insights from the dataset that consists of all traffic accidents in Greater London for particularly two personas, the commuters and the health workers.

St Paul’s Cathedral, London, UK. Anthony Delanoix | Unsplash

There is no doubt that in the pre-pandemic period, the commuters were the ones who experienced the most intense hours in the city in their use of transportation, whilst the health workers have faced the busiest times amid the COVID-19 pandemic. This is why this analysis focuses on these two groups of people while proposing areas to improve the health and safety conditions of local people in a broader sense. The questions to be answered at the end of this analysis are below:

For the commuter persona:

What are the riskiest districts and streets for commuters?

Are there identifiable temporal patterns in traffic accident frequency in identified districts and streets?

For the health authorities persona:

Are there any risky areas in terms of health services?

Is it possible to optimise the working hours of health workers in these areas?

Data

The traffic accident-related data were collected from data.gov.uk, the official open data platform of the UK and the hospital data was collected from NHS website datasets. The traffic accidents data containing information about accidents, vehicles and casualties for the years 2019 and 2020 has been downloaded as CSV files and uploaded into a variety of necessary tools for the analysis.

Analysis

The datasets were cleaned and prepared for analysis after the exploratory analysis was completed. All analytical tasks to prepare the data for visualisation were performed in a Jupyter notebook, which can be accessed from the corresponding Github repository. The main python libraries used are pandas and numpy for data wrangling, sklearn for clustering & classification, and plotly for visualisation.

The aggregated data was exported in intermediary steps to perform visualisations in Tableau, one of the three leaders of business intelligence platforms according to Gartner’s Magic Quadrant. Data visualisations are used for analytical reasoning and as potential inputs for the next steps throughout the analysis.

The detailed process flow of the analysis can be seen below and the paper of the study can be accessed here.

The logical flow of the analysis

Results

Characteristics of Traffic Accidents

Spatial Distribution of Accidents

The data was first visualised to observe the spatial distribution of all the accidents in the UK for the time period between January 2019 and December 2020. Every blue point corresponds to a single accident in the visual. The Greater London area, where the analysis focuses on, is the densest area in terms of accidents compared to the other regions in the UK.

Spatial distribution of traffic accidents in the entire UK for the years 2019 and 2020

Temporal Distribution of Accidents

To review the temporal distribution of the data, a line chart with the year breakdown was plotted. It is obvious that the trends of both years 2019 (orange) and 2020 (blue) are very similar except for the steep decrease in March 2020, which corresponds to the first lockdown of the COVID-19 pandemic.

Temporal distribution of traffic accidents broken down by 2019 and 2020

Commuter Persona

Temporal Distribution of Accidents in Week Days

After the necessary date-based and time-based filtering were done, the temporal distribution of the accidents was plotted on a bar chart, this time to observe a more granular level: hours of the day. Bars were colour-coded according to the mean accident severity. There is an observable temporal pattern for the time period between 7 am — 9 am in the morning and 16 pm — 18 pm in the afternoon.

Temporal distribution of traffic accidents for weekdays

The Riskiest Roads and Areas for Commuters

The data was filtered one more time for the peak hours identified in the step before and the DBSCAN method was utilised to create clusters for the riskiest roads and areas for both driver and cyclist commuters. Progressive DBSCAN clustering was aimed by selecting appropriate minimum samples and epsilon parameters to identify road-like and linear-shaped clusters. In the end, the riskiest roads and areas for commuters were identified. While the riskiest areas for commuters who drive are Lambeth, Enfield and Croydon, the riskiest areas for cyclist commuters are Wandsworth, Lambeth and Southwark.

Spatial distribution of traffic accidents for weekdays

The Riskiest Time Periods for the Riskiest Roads and Areas

The temporal distribution for the peak hours was plotted only for the coloured cluster data items identified in the previous step for both cyclists and drivers. Leaving home and office early while commuting could reduce the risk of the accident for cyclists since the accident frequency increases as it gets late inside the peak hours. For drivers, while leaving home early could reduce the risk when going to work, the accident frequency has a more uniform distribution when going home.

Temporal distribution of traffic accidents in identified riskiest areas for the workdays and peak hours only

Health Worker Persona

Spatial Distribution of Districts and Hospitals

The accident data was aggregated by district and mean casualty severity was calculated for every district. Every district is represented by a circle, the size of which corresponds to the mean casualty severity for that district. Every plus sign in the visual represents a hospital. With the assistance of this visual, it is possible to observe the spatial distribution of districts and hospitals in Greater London, however, the view is still cluttered and it is hard to make any inference about the risks in different regions.

K-means clustering results. Clusters include both data categories: districts as circles and hospitals as plus shapes

Hazard Clusters

To create a more condensed view for identifying risks, districts and hospitals were combined and clustered together using K-means clustering to cover all the data points in the dataset. The elbow-curve method was used to decide the optimum number of clusters, which is 6. The centroids of the 6 clusters were plotted on a map. For the size and the colour intensity of the data points, a new hazard measure was derived using the number of hospitals and the mean casualty severity of the districts included in clusters. The bigger and the darker the data point is, the riskier the regions represented by data points are. This visual leaves us with the information that clusters 2, 4 and 5 represents the regions with the highest risk in terms of health services, i.e. hospital capacity may be inadequate in these regions at certain times.

Centroids of the clusters formed by the combined district and hospital data points, colour and size representing the hazard variable

Temporal Distribution of The Riskiest Regions

To identify the time periods when the hospital capacity may not be adequate, the temporal distribution of the accidents was plotted on a heatmap for the riskiest regions identified in the previous step. This visual suggests that while the casualty severity tends to be higher between 8 pm — 12 am in the Southwest region, it tends to be higher between 12 am — 4 am in the West region. In the South region, on the other hand, casualty severity is distributed more homogeneously to all hours of a day. Based on these results, the working hours of the health workers in the hospitals in these risky regions can be adjusted accordingly.

The temporal distribution of casualty severity in the riskiest 3 regions in Greater London

Work / Health / Safety

Work is important, yet what is more important is your health and safety. This analysis revealed some important insights for the commuters to be more aware of the risk status of the roads and the time periods in peak hours, and for the health workers to be mindful of the time periods with a potentially higher risk in terms of traffic accident casualties to more effectively manage the workload in hospitals.

It does not matter whether you are a cyclist crossing the London Bridge every morning or a doctor driving to The Royal London Hospital. When it comes to any area that we have the data of (traffic accidents in our case), it is for your own safety to look for previous analyses and extracted insights to inform our decisions. This will eventually make us better decision-makers who are more data-driven and informed about our environment. And the last quote (a little bit cliche but still holds):

Information is power only if you can take action with it.

Daniel Burrus

Leading Futurist. Strategic Advisor. Disruptive Innovation Expert.

--

--

Emre Can Okten

Data Science and Analytics Professional & Technology Enthusiast