The challenges posed by officially published open data

August 28, 2019

In this article, I outline some challenges posed by officially published open data and explore the pragmatic approach of London-based technology company Doorda

I believe a tipping point has been reached and it is now essential for private and public organisations to capitalise on the tsunami of open data to provide improved products and services. Furthermore, organisations need to access “the whole wave”, not just a part of it. Ultimately, they need to mix it all in with their existing “lake” of data.

Open data definition

Data is considered “open” if anyone is free to use, reuse and redistribute it – subject only to the requirement to attribute and/or share-alike. Most national governments are pushing the publication of open data, implemented on their behalf by official institutions, nationally and regionally.

Open data is usually information and statistics about the population, the areas where they live, the companies they work for and the things that affect their lives – transport, crime, health, trade, public sector expenditure, education, vehicles, weather, etc. The list grows daily.

Although open data is non-personal, completely avoiding privacy concerns such as GDPR, such data does present significant challenges.

Open data challenges

Unfortunately, organisations rarely make use of this wealth of data because it is published in awkward, inconsistent ways and often cannot easily be linked together.

Data is provided in various places and formats (national websites, regional databases, unstructured documents, semi-structured messages and structured records) and is accessible in various ways (on a webpage, from file transfer or programmatically).

Sometimes, data is published automatically to a location on the internet; sometimes it is made available on request; sometimes it is delivered by email. The frequency of publication varies according to the source of the data and can be any combination of decennial, annual, quarterly, monthly and daily. Publication can also be irregular, depending on human factors or events.

Since there are thousands of different organisations involved in the data collection and most of it is manually entered, the quality and consistency of this data is poor. Acme Widgets Limited in one town could be Acme Widgets (UK) in another, an instant data matching problem for anyone trying to automatically identify the property footprint of that company, along with any associated liabilities, licences or sanctions. Furthermore, this inability to reliably and consistently identify a property, area or company by an official unique identifier is problematic when trying to associate or join open data to existing internal data.

Although open data almost always provides a latest or current view, many publishers do not provide a historical view, making it impossible, for example, to see data as it was on the day a decision was made or to see changes undergone over time. Implementing automated processes to provide history is complex and costly.

The publishing of open data by public sector organisations is not their primary focus, so data publication solutions are often fragile. Despite the efforts of the teams involved, “data outages” are common.

Multiple tools and strong technical expertise, not often available to those wishing to access and use the data, are often needed to exploit open data.

Data uses

Yet there are rewards for those able to overcome the challenges.

Data scientists are keen to add new data to their predictive models for risk. Customer-facing organisations want frictionless web interfaces, automatically filling in the correct information to reduce keystrokes and accelerate sign-ups. For example, some websites use vehicle registration numbers to automatically and reliably fill out make and model. Many organisations want to analyse trends in public sector expenditure. Marketeers are enriching in-house information with fresh data, improving their prospecting success rates.

Further examples:

Customer acquisition

Sophisticated customer segmentation to better identify target markets, improve marketing response models, refine risk assessments and provide speedy, frictionless on-boarding.

Customer management

More complete customer models maximise potential revenue, assess risk events and optimise debt recoveries.

Business planning

Broader, more complete data improves location planning, supplier assessment and competitive intelligence.

Commercial finance

Proactively find potential funding needs and improve marketing responses and customer risk analysis.

Commercial property

Improve location planning, win-rates for ratings appeals and investment analysis.

M&A/Capital markets

Use a summary of trading premises, property assets, controlling parties, public sector contracts and receipts to inform conclusions and decisions.

The best approach

I have worked with open data for many years now, researching, gathering, consolidating and linking thousands of Open Source datasets from official sources such as HMRC, Ordnance Survey, the Land Registry, local authorities and Companies House. I believe that capitalising on the value of this tsunami of data has three main themes:

Business-ready data

The data must be made “business-ready”, allowing the experts to focus immediately on analysis and insight while avoiding the repeated delays, costs and risks of finding and preparing data. In fact, by harmonising data from multiple sources, it is possible to resolve many inconsistencies and errors, providing better data quality than from any single source.

There must be processes to identify and store changes in source data, building an up-to-date historical trail. Even if there are derived “scores” the untampered, detailed data must remain available, allowing analysts and data scientists to create their own unique insight and competitive advantage.

Joined-up data

Where it makes sense, it must be possible to consistently and reliably join the open data from all these sources, to join the results to other third-party data and to join with internal data held by the organisation. During harmonisation, certain key data elements should be identified, cleansed and standardised to allow joining – postal address, postcode and company names.

Unlike legacy data matching solutions, this new data matching service must be automated, not require any human intervention and avoid “false positives”. The data matching service must remain available to cleanse and match other data sources when required.

Self-service cloud platform

The data should be available on a “highly available” self-service cloud platform, with automated feeds keeping the data fresh and building a historical audit trail. Access to the data should remain simple, though, in bulk (extracted or queried) or by individual transaction via programmatic interfaces (SQL and API).

Conclusion

Overcoming its challenges to capitalise on the value of Open Data avoids privacy concerns such as GDPR, however, the value of open data for organisations is not found in isolated files, it is in consolidating all the relevant data into a single platform and providing easy, “joined up” access.

Please note: This is a commercial profile

Clifford McDowell

CEO

Doorda Ltd.

Tel: +44 (0)800 133 7108

clifford@doorda.com