The following article details the process on collecting crime incident data in England from 2011-2022 in a single tabular file.
Crime incident here can be defined as crime incidents that occur in England recorded at a chosen geographical level. The geographical level in question here is a lower layer super output area (LSOA). Every crime incident fall within a LSOA and each corresponding LSOA code and name is provided. LSOAs can be understood as one of the constituent areas that makes up England. The total number of LSOAs in England currently is 32844 and were produced by the Office for National Statistics for the reporting of small area statistics. The table below details the geographical information on constituent areas in England.
Table 1: Description of area types in England. Adapted from LSOAs
Area Type | Description |
---|---|
Output Area (OA) | Have an average population of about 310 residents and is the smallest level of geographies that data is published on. Apart from Census outputs, not very much data is available |
Lower Layer Super Output Area (LSOA) | Have an average population of 1500 people or 650 households. Much more data is available at this level |
Middle Layer Super Output Area (MSOA) | Have an average population of 7500 residents or 4000 households |
OAs rests within the boundaries of LSOAs and LSOAs rests within the boundaries of MSOAs and finally, MSOAs rests within the boundaries of local authorities. Figure 1 below illustrates how these different geographies are related to one another.
Figure 1: Illustration of Area types in England. Adapted from LSOAs
The type of crime incidents and its corresponding description is as in Table 2 below.
Table 2: Description of type of crime incidents. Adapted from Police.UK
Type of crime incidents | Description |
---|---|
Anti-social behaviour | Personal, environmental and nuisance anti-social behaviour |
Bicycle theft | Taking without consent or theft of a pedal cycle |
Burglary | Offences where a person enters a house or other building with the intention of stealing |
Criminal damage and arson | Damage to buildings and vehicles and deliberate damage by fire |
Drugs | Offences related to possession, supply and production |
Other crime | Forgery, perjury and other miscellaneous crime |
Other theft | Theft by an employee, blackmail and making off without payment |
Possession of weapons | Possession of a weapon, such as a firearm or knife |
Public order | Offences which cause fear, alarm or distress |
Robbery | Offences where a person uses force or threat of force to steal |
Shoplifting | Theft from shops or stalls |
Theft from the person | Crimes that involve theft directly from the victim (including handbag, wallet, cash, mobile phones) but without the use or threat of physical force |
Vehicle crime | Theft from or of a vehicle or interference with a vehicle |
Violence and sexual offences | Offences against the person such as common assaults, Grievous Bodily Harm and sexual offences |
Some of the crime types may seem to bear the same meaning, but there exists a clear distinction among them. Chief among them are theft, burglary, and robbery. Although they may be used interchangeably in our day-to-day to lives, care must be taken here to distinguish the types which may bring about different levels of severity depending on the case. To incorporate massive datasets such as this, data.police.uk offers chunks of data in compressed file format. Example of this is shown in Figure 2 below.
Figure 2: Example of chunks of crime data in compressed format. Adapted from data.police.uk
To satisfy this project requirement of data from 2011 till 2022, the compressed files that need to be downloaded are essentially 3 files which are,
- Crime data from October-2019 to September-2022. Approximately 1.52GB in size.
- Crime data from October-2016 to September-2019. Approximately 1.57GB in size.
- Crime data from December-2010 to September-2016. Approximately 2.09GB in size.
Crime data of December-2010 will be manually deleted after downloading as it does not fulfil this projects timeline requirement. The folder and file structure after extracting the compressed files are depicted in Figure 3 and Figure 4 respectively.
Figure 3: Folder structure after extracting compressed crime data
Figure 4: File structure inside the folders
Figure 3 informs us that the folder structure is organized by year and month. Figure 4 on the other hand informs us that the file structure is organized by year, month, LSOA name and crime file type. Crime file type consists of outcomes, stop-and-search and, street. For the purposes of this project, only crime file type of street shall be included. Figure 5 depicts the snippet of unprocessed crime incident data for the city of London in January of 2011.
Figure 5: Example of unprocessed crime incident data
The basic idea of the steps taken to combine all required crime data in a single file is as below,
- Iterating through 3 main parent files “2011-2016”, “2017-2019”, and “2020-2022”.
- Iterating through the main parent files to access the crime data by year and month.
- Iterating the crime files and selecting only crime files of type “street”.
- Crime files of type “street” are appended to a Python list in the form of a Pandas DataFrame.
- This Python list will continue appending with the next Pandas DataFrame of the next month.
- This process is then repeated for months from 2011 till 2022 which then will be merged into a single Pandas DataFrame which can then be exported in a tabular format.
The combined file has 42 million rows of data, 10 columns and is around 7.5GB in size. The description of the columns is detailed in Table 3 below and the snippet of the combined crime incident data is shown in Figure 6.
Column | Description |
---|---|
Month | Month of the crime incident. Data type: string |
Reported by | The police force name that provided the data about the crime. Data type: string |
Falls within | The police force name that provided the data about the crime. Data type: string |
Longitude | Anonymised longitude coordinate of the crime incident. Data type: float |
Latitude | Anonymised latitude coordinate of the crime incident. Data type: float |
Location | Description of the crime location. Data type: string |
LSOA code | References the LSOA code that the anonymised coordinate point falls into. Data type: string |
LSOA name | References the LSOA name that the anonymised coordinate point falls into. Data type: string |
Crime type | Types of crime category. Data type: string |
Last outcome category | Reference to whichever of the outcomes associated with the crime occurred most recently. Data type: string |
Figure 6: Snippet of Combined Crime Incident Data
The Python notebook for the data collection mentioned in this article is hosted in my Github Repo.