The NCCU has large amounts of text data acquired from a range of sources that they analyse to identify cyber-criminals. A key focus of this work is identifying features about the criminals that can be passed to intelligence officers, including the nationality and geographic location of the actor. The NCCU wanted to understand the feasibility of developing and scaling a machine learning approach to identify the nationality of online actors from their written English.
We began the project by collecting user stories, assessing and understanding the client data and systems. We also performed background research related to the problem, including a review of relevant recent NLP literature, to help upskill the customer. Following this initial phase we adopted an agile approach, identifying and agreeing two packages of work that would benefit the client most, based on new information.
- The first involved representing the text by calculating the frequency of various linguistic features. We then used these representations to train machine learning models to produce and native/non-native prediction.
- The second approach used the raw text to train an artificial neural network. Rather than explicitly calculating features, the model was set up to learn the best features to use for itself.
The primary benefit to the client was an increased capability to identify cyber-criminals, the main function of the organisation. We delivered this by providing a range of approaches for identifying UK actors online that will support the intelligence teams by enabling high throughput analysis of text data.
Drawing insght from text data in an automated manner and, on such a large scale, requires specialized skills in data science, natural language processing and machine learning. Our consultants brought this expertise but also worked with the customer to transfer knowledge and support the customer in the long term as they grow their internal capability.