Managing Data Feed Integrations for AI systems
You must evaluate whether the identified external data is capable of supporting the intended outcome. To do that, it’s important to use partial datasets and simplified logic to test the validity of the data.
The objective is to assess correlation rather than determine absolute data accuracy. In other words, answer the question: ‘Do changes in the data move scores in ways our client’s domain experts expect?’ If relevant correlation signals are absent at this early stage of a pilot, scaling will not fix it!
It’s also important at this point to avoid a natural tendency to over-collect data. Each data feed must justify its inclusion based on three key criteria:
If removing a feed does not materially change outputs or decisions, then it does not belong in the core data feed pipeline.
You must hardwire the types of data feeds you select to the business objective and to the solution you are trying to create. For example, GaiaLens built its own ESG scoring and anti-greenwashing AI-based solution for asset managers and financial institutions.
This system draws on a mixture of third-party Application Programming Interfaces (APIs), structured batch datasets, regulatory disclosures, event-driven updates (including annual reports), and ‘derived’ datasets created by our data team through contextual enrichment and data normalisation.
Data normalisation is the disciplined process that makes different data points comparable, stable and safe to combine so that like can be compared with like. Scale cannot distort importance. Noise must not overwhelm signal, and outputs must remain interpretable, defensible and explainable.
Assessing the reliability, latency and volatility of external data sources, is also vital work for our data engineers. We measure reliability in terms of historical uptime and schema stability; latency by delivery consistency; and volatility by how often values change unexpectedly.
Real-time pipelines prioritise resilience and allow graceful degradation. When a real-time data feed becomes unavailable, latency increases beyond the pre-agreed tolerance. It causes data quality to drop below acceptable thresholds. If schema or semantic changes are detected, the system must deliberately reduce its capability. It must also be designed to preserve correctness over completeness and avoid corrupting ground truth or scores.
Near-real-time pipelines focus on checkpointing and replay. When a processing failure occurs midway, the system must establish at which point the near-real-time pipeline crashed, the last successful checkpoint and the number of that record.
On restart, the system restores the state from that checkpoint and re-reads data from that point. It then recomputes outputs. This ensures no data is lost, avoids double-counting, and enables a deterministic recovery.
Batch pipelines, by contrast, emphasise validation and reconciliation. Bear in mind, batch pipelines’ primary design objective is correctness rather than immediacy. Batch processing is typically used where data feeds define records of truth, support financial, regulatory or reporting outcomes. They must be complete and internally consistent. It must also be possible to prove this data quality.
During pilots, GaiaLens applies checks for data feed completeness, freshness and logical consistency. Missing or stale data is explicitly flagged. Anomalies are isolated and investigated offline before they influence scores.
Silent substitution must be avoided as it quickly compromises AI systems and is difficult to detect and recover from. This is partly because it tends to produce outputs that look plausible. However, it will corrupt the scoring and ground truth quietly without anyone finding out until it is too late. In short, silent substitution can invalidate months of results.
Define and agree on ‘ground truth’ with business and domain experts. It is a governance and design exercise first, and a technical exercise second. The objective is not to find a philosophically ‘perfect’ truth. You establish a shared, testable, auditable reference reality that the organisation agrees to treat as correct for a specific decision, at a specific point in time.
Ground truth is meaningless unless it is anchored to a business decision or outcome. It’s important to answer the following questions as part of ground truth establishment:
Ground truths must all be reviewed regularly. Assess whether your ground truth(s) are explainable to a regulator, auditor or standards body. If they sound too vague, this is a good indication they are not precise enough.
With trust-based domains like ESG scoring, you need to be able to drill down into factors which contribute to the scoring. Black box systems which spit out uncheckable scores are not good enough for most AI applications. Your AI system must be fully explainable and transparent, especially when underpinning regulatory reporting.
It is also important to be able to measure confidence, uncertainty level or margin of error in scores. Confidence ranges based on data coverage are only credible if they are grounded in measurable coverage indicators. GaiaLens draws on up to 10 typical data coverage dimensions. These are built into our transparency scores to measure confidence level as follows:
Handing back AI systems which we have designed and developed for clients is sensitive work. In-house teams need multiple skills to ensure the health and effectiveness of AI systems going forward.
They need mature data engineering, domain expertise, governance and operational monitoring capabilities, not just data science skills. In terms of governance structures, clear ownership of data sources, scoring logic and change controls needs to be supported by cross-functional oversight.
Finally, it is important to consider and mitigate the biggest data feed integration risk. It can derail an AI pilot in the vital first 90 days of a project. The biggest risks, in my view, are that the team overestimates the data quality the system is generating from Day 1, while underestimating schema volatility.
Be ready to monitor and tweak systems to keep them on track. To this end, phased delivery reduces risk, builds trust and allows learning and hardening of systems before scaling.
The post Managing Data Feed Integrations for AI systems appeared first on Enterprise Times.
The Simpsons has mocked or referenced literature over its many seasons, usually through a book…
A new and more dangerous type of malware is quietly targeting Windows users by hiding…
A new and more dangerous type of malware is quietly targeting Windows users by hiding…
SonicWall has released a security advisory addressing three vulnerabilities in its SonicOS software. Discovered by…
SonicWall has released a security advisory addressing three vulnerabilities in its SonicOS software. Discovered by…
A major international law enforcement operation has brought down a large-scale online fraud network that…
This website uses cookies.