Editor’s Note: Learn more about the technical details of this article at the talk “Deep Learning for Third-Party Risk Identification and Evaluation at Dow Jones” at ODSC Europe 2019

For more than 17 years, Dow Jones has supplied risk and compliance data to banking and financial institutions, corporations and governments across the world. Through sharing defined and structured content sets of people and entities, customers can manage third-party risk–anti-money laundering, anti-bribery, corruption, sanctions and reputational risk. In order to achieve comprehensive coverage guided by international regulation and guidance since 2002, we follow very high editorial standards and research methodologies to manage 30 risk categories in over 70 languages, 24 hours a day.

[Related Article: Transaction Data Enrichment, an Opportunity for Financial Wellness]

We wanted to apply natural language processing (NLP) techniques to risk and compliance data research process for risk profile creation and management. The objectives in doing this include: eliminating low-level, repeatable and manual processes; pulling actionable insights out of unstructured data; gaining intelligence from global media and research tools; scanning and monitoring almost 2 million articles per week; and achieving near-real-time risk data detection and delivery capabilities.

And of course, to help make better decisions, faster.


Dow Jones owns Factiva, an over 30-year archive of premium news articles that is growing all the time. These articles serve as data points that can inform evolving industry demand in portfolio management, sales, business development, risk target identification and aggregation of deal opportunities, amongst others. This 30-year archive could be an invaluable dataset for ML-driven research and information extraction to obtain semantically well-defined data from a chosen target domain, interpreted with respect to category and context.


Our approach is founded on the assumption that a set of long and unstructured article texts contain a complex network of one or more entities, events, and relations that can be automatically processed and structured to expand our knowledge of the world and prompt action. Our aim is to develop a system for the extraction of this sort of information from raw texts. To formulate the problem a different way: we know that there is lots of useful information buried in news texts, so how can we uncover all the wealth of unstructured data and reveal hidden knowledge using machine learning and particularly information extraction techniques? Accordingly, we deal with a structure prediction task.

Knowledge extraction

Criminal liability and fines, disgorgement of profits, forfeiture of assets, and loss in share value–these are the consequences companies face for failing to prevent illegal or unethical activity by third parties with which they work. Imagine that your job is to ensure that there’s nothing in the corporate history, organizational structure, ownership, operations, and with senior management, directors, and owners that could indicate conduct activity. You need to be confident in your understanding of where third-party risk lies and effective in focusing efforts on the highest risks.

Risk Identification and Evaluation

For instance, while identifying a sanctioned entity or person is a straightforward task since such data is initially provided in a structured format, the impact of sanctions goes far beyond and impacts its relationships as well. For example, OFAC sanctions mean that a subsidiary of the sanctions target will be blocked from making bank payments, which might have further impact down the chain.

To that end, NLP-driven structure prediction solution may help scanning through sanctions related news articles with each article presumed to be based upon one or more sanction events. Based on the assumption we made, for any of our information extraction task there is a template, which is a set of case semantic frames to hold the information contained in a single document text (article). For our simplified case sample, a sanction event may involve a regulator name by which a company is sanctioned, the sanctioned company itself, the company’s country of operation, its ownership and senior management members, direct and indirect subsidiaries, suppliers and clients, and the date on which the event happened. To automatically “understand” this set of articles, we can generalize the problem of finding data corresponding to the slots in this template.

For you, it will mean an opportunity to identify potential disruptions to the relationship chain after the introduction of sanctions to set off a back-up plan.

DUG use case

Another interesting example of NLP application to Dow Jones risk and compliance data is dual-use goods content set. Dual-use goods are products and technologies that are common in regular, everyday use but that also could have military applications. Examples of dual-use goods are some models of drones, aluminum pipes with precise specifications or certain kinds of ball bearings. Because of the potential risk that these goods could be used for military purposes, regulators try to control, or at least monitor, when and where these goods are sold. Every actor in the supply chain, including banks financing the goods, are now required to conduct additional checks, such as export control licensing, counterparties to the trade, means of transport and locations.

Tracking dual-use goods also may seem a trivial task in theory, but in practice it is both arduous and resource-intensive. As a result, financial institutions tasked with tracking dual-use goods using broad terms can end up with hundreds of false positives and incur additional costs; they need a combination of industry expertise and accurate and actionable data to enable transaction screening.

[Related Article: How AI will Disrupt the Financial Sector]

There is no universal language for describing the goods being traded. Some goods, or the materials contained within goods, are difficult to identify. Thus, the idea behind our research was to train a machine to develop a sense of context for the words in our vocabulary (i.e. all of the unique words in the data we’re using to train the language model on) and to look for the connections humans have missed. Founded on such discovered connections, the model is able to predict possible dual-use goods descriptions that do not match the formal terms regulators use to describe dual-use goods.

Editor’s Note: Learn more about the technical details of this article at the talk “Deep Learning for Third-Party Risk Identification and Evaluation at Dow Jones” at ODSC Europe 2019

Originally posted on OpenDataScience.com