Accelerating AI Training Data Delivery with Real-Time Analytics

Appen supplies the training data that enterprise AI teams depend on to build production models. The company manages annotation workflows across text, image, audio, and video for clients training at scale.

`01` The Challenge

Thirty-minute report cycles in a market that expects real-time visibility

The AI data services market is projected to reach $28.5B by 2034, growing at 28.6% CAGR. In that market, processing and delivery speed is the differentiator. Annotation teams need quality metrics, throughput rates, and job status in near real time. Without that visibility, errors accumulate before anyone notices. Business units need pipeline data to commit to client SLAs. When reporting lags, decisions lag with it.

Appen’s existing analytics platform generated reports in 30-minute cycles. The architecture was built in the company’s early days. It handled moderate volumes. It was not designed for 1,000 parallel workflows or real-time dashboards serving annotation teams across time zones.

Data teams waited half an hour to see whether a job was running correctly. By the time a quality issue surfaced, thousands of items might already been annotated against a flawed configuration.

The business case was specific:

Support up to 1,000 parallel data collection, processing, and reporting workflows
Move report generation from 30 minutes to near real time
Build an analytics foundation that could absorb growth in data volumes without degrading

In 2020, Appen integrated Figure Eight (a human-in-the-loop platform for data transformation), adding new data streams and annotation modalities. The volume pressure intensified. Appen partnered with Provectus to redesign the analytics platform from the ground up.

`02` The Approach

Map the bottleneck. Prove linear scalability before production rollout.

Provectus began with discovery workshops to assess Appen’s existing operations, infrastructure, and data flows. The team mapped where latency accumulated. Three sources: batch report generation, monolithic workflow orchestration, and a storage layer not built for concurrent reads.

The assessment confirmed what Appen’s leadership suspected. The legacy architecture no longer could support real-time analytics or parallel processing at the volumes the business now required.

From the assessment, Provectus defined the target. A cloud-native data platform on AWS. Event-driven pipelines. A data lake for flexible querying. Microservices replacing the monolithic workflow engine. The design principle was linear scalability. Adding annotation jobs should not degrade performance for existing ones.

Before full production rollout, Provectus ran load tests proving linear scalability against Appen’s projected growth targets. The architecture had to pass before migration began.

`03` The Build

Event-streaming pipelines, a cloud-native data lake, and microservices replacing monolithic batch processing

The engineering work covered three layers.

Data pipelines. Provectus built event-streaming pipelines using managed container orchestration and cloud storage on AWS. Appen’s existing data streams and annotation workloads migrated to these pipelines after testing and optimization. Data flows in real time from ingestion through processing to reporting.

Data lake and analytics. A cloud-native data lake replaced the legacy storage and query layer. Annotation teams and business units run on-demand queries for report generation. No more waiting for batch cycles. Dashboards reflect pipeline state as it happens.

Microservices and event backbone. Provectus decomposed the monolithic data workflows into dedicated microservices. Each handles a segment of the annotation pipeline in real time. An event-streaming layer serves as the communication backbone between services and databases. This pattern ensures that adding a new workflow type does not require changes to existing services.

Zero-downtime migration. A data synchronization approach fed records from the existing database into new services during cutover. Annotation teams kept working. No jobs were paused.

`04` The Results

From 30-minute report cycles to sub-minute. From bottlenecked pipelines to 20x throughput.

The platform changed how Appen’s annotation operations run day to day. Data teams see quality metrics and job status as events happen. Business units commit to client SLAs with real visibility into pipeline capacity.

30 min → under 1 min

Per-report generation time

30x faster than the legacy baseline

Processing speed across the full annotation pipeline increased 5x. Data annotation job throughput climbed 20x. The platform handles 1,000 parallel workflows with linear scalability: adding volume does not degrade performance for existing jobs.

Appen’s revenue depends on delivering training datasets faster than competitors. The operational gains map directly to client retention and deal velocity. Annotation teams catch quality issues in seconds, not after thousands of mislabeled items have shipped downstream.

`05` What’s Next

Scaling into multimodal annotation and model-assisted labeling

The platform positions Appen to move into higher-margin AI data services. Model-assisted pre-labeling, where AI handles first-pass annotation and humans correct. Provectus continues to partner with Appen on extending the platform as new modalities and workloads arrive.

Accelerating AI Training Data Delivery with Real-Time Analytics

01 The Challenge