Home Open Access News Digital Transformation The synthetic data approach: The new unlimited data plan for AI models

The synthetic data approach: The new unlimited data plan for AI models

June 10, 2022

Dayna Arnold, Project Manager at Zest Consult, discusses the benefits of using a synthetic data approach to machine learning as an innovative solution for increasing the availability, accuracy & security of more cost effective data

Innovation is critical for the growth of businesses today with the development of artificial intelligence (AI) as a key driver. When successfully deployed, AI improves how a business engages with customers, manages day-to-day operations, and gains a competitive edge. Machine learning, an application of AI, requires vast amounts of data and uses algorithms to successfully imitate how humans learn. Machine learning is used in a wide range of applications. (i.e., object recognition, text translation, smart devices, etc.) Successfully deploying AI models is not without challenges, which, if handled inadequately, could prevent the business from moving forward with their AI deployment goals.

The data challenge

The machine learning approach for performing object recognition requires the acquisition of vast amounts of labelled data to successfully train machines. Obtaining these large amounts of real-world data sets can prove to be a major roadblock for an organisation’s AI deployment plans, as the large amounts of labelled data required may be unavailable, excessively time-consuming, and/or prohibitively expensive to obtain.

For effective machine learning, real data needs to be labelled (identifying objects in the data and tagging them with labels that will help the machine learning model learn from the data). Good quality labelling enables the machine to make accurate predictions and estimations. Paying labellers for their time and effort significantly increases the cost of developing AI. Therefore, not only can it be difficult to get large quantities of real data, but manually labelling datasets is labour-intensive and excessively expensive.

Considerations must also be made with respect to accuracy and security. Real-world data sets have the potential to exclude certain populations or contain errors that can lead to inaccurate models of real-world scenarios which prove to be unsuccessful during testing or produce results that are inaccurate or biased. Finally, those who have access to real-world data are responsible for ensuring that sensitive data (i.e., personal identification data) is used ethically and in compliance with General Data Protection Regulations (GDPR). Sensitive data must be protected from unauthorised access or unethical use. Securing data can be expensive and restrictive which limits the amount of data available.

Synthetic data approach in AI technology

Using the latest research in AI and machine learning for vision, a collaborative team of members from Zest Consult, Costain, and the University of West England developed and applied a synthetic data approach to designing a leading-edge AI and machine learning system for the remote monitoring of construction sites. This system was designed with AI technology that enables the machine to simulate Visual Recognition. The main challenge limiting automated site monitoring was the lack of labelled data sets to train object recognition machine learning models. However, the initial data set generated with the AI algorithm is automatically labelled, as it is synthetic data, therefore, human effort for annotations or quality validation was not required; further saving the team both time and money as a result.

Ensuring accuracy is another benefit when applying a synthetic data approach to machine learning. Synthetic data is created using AI algorithms that represent data with appropriate balance, distribution, and other parameters as opposed to using real-world events where there is less control over such parameters. This reduces the bias in datasets which ultimately increases the accuracy of the predictive modelling used in machine learning.

The synthetic data approach addresses the challenges of protecting sensitive data as well. The data generated is new data that has the same characteristics of the original real data thus producing the same results. However, instead of merely altering the data, the synthetic data replaces the original data mimicking only the statistical properties while maintaining the same predictive ability. As a result, it is nearly impossible to recreate the original data which keeps the original data secure. This eliminates the obstacles of privacy and security regulations which would otherwise have made it more difficult, costly, and time consuming to obtain.

The synthetic data approach developed and used by the Zest Consult collaborative team was an augmentation approach, in which a synthetic data set was generated using various computer algorithms. These algorithms were applied to a combination of a small quantity of real-world images (Construction site and Google images) and a 3D model of a common guardrail. This approach eliminated the need to collect a significant amount of real data which is especially beneficial where data collection would not have been feasible due to time/location limitations.

Once the initial synthetic data set had been established, the amount of time and effort involved in generating more synthetic data decreased considerably, driving the cost and amount of time needed in the development of their AI model even lower than had they obtained only real-world data. From this initial synthetic data set, the vast amounts of synthetic datasets resembling real-world scenarios using these 3D virtual scenes, 3D models of workers, components, and plant equipment were generated and used to train the machine to recognise, track, and record the presence and proximity of construction equipment and workers from various camera feeds.

Results and improvement in the synthetic data approach

The Zest Consult collaborative team’s synthetic data approach was proven to be highly effective when applied to machine learning (Patent No: 2204397.0 pending). Their Object Recognition AI models performed with a high level of accuracy, as evidenced by the significantly high rate of ~80% reduction in false positives yielded in the field.

This synthetic data approach to AI/machine learning provided a qualitative improvement over the traditional manual site monitoring services and enabled the team to achieve their goal of delivering a system that performs robust object recognition, tracking, and photogrammetry that would result in boosting environmental, safety and productivity capabilities.

Overall, the team found when they were not bound by the restrictions of collecting and using solely real-world data for machine learning, they were able to generate the vast amounts of synthetic data required for machine learning in significantly less time and at a much lower cost that resulted in the development of a more accurate AI model.

Please note: This is a commercial profile