Appen cuts report generation from 30 minutes to under one minute and increases data job throughput 20x, enabling faster delivery of training datasets to enterprise AI teams.
Client profile
A global AI training data company serving enterprise ML teams
Industry
Other, AI & Data Services
Region
Global
Faster analytics and data reporting
Increase in data annotation jobs handled per hour
Appen supplies the training data that enterprise AI teams depend on to build production models. The company manages annotation workflows across text, image, audio, and video for clients training at scale.
01 The ChallengeThe AI data services market is projected to reach $28.5B by 2034, growing at 28.6% CAGR. In that market, processing and delivery speed is the differentiator. Annotation teams need quality metrics, throughput rates, and job status in near real time. Without that visibility, errors accumulate before anyone notices. Business units need pipeline data to commit to client SLAs. When reporting lags, decisions lag with it.
Appen’s existing analytics platform generated reports in 30-minute cycles. The architecture was built in the company’s early days. It handled moderate volumes. It was not designed for 1,000 parallel workflows or real-time dashboards serving annotation teams across time zones.
Data teams waited half an hour to see whether a job was running correctly. By the time a quality issue surfaced, thousands of items might already been annotated against a flawed configuration.
The business case was specific:
In 2020, Appen integrated Figure Eight (a human-in-the-loop platform for data transformation), adding new data streams and annotation modalities. The volume pressure intensified. Appen partnered with Provectus to redesign the analytics platform from the ground up.
02 The ApproachProvectus began with discovery workshops to assess Appen’s existing operations, infrastructure, and data flows. The team mapped where latency accumulated. Three sources: batch report generation, monolithic workflow orchestration, and a storage layer not built for concurrent reads.
The assessment confirmed what Appen’s leadership suspected. The legacy architecture no longer could support real-time analytics or parallel processing at the volumes the business now required.
From the assessment, Provectus defined the target. A cloud-native data platform on AWS. Event-driven pipelines. A data lake for flexible querying. Microservices replacing the monolithic workflow engine. The design principle was linear scalability. Adding annotation jobs should not degrade performance for existing ones.
Before full production rollout, Provectus ran load tests proving linear scalability against Appen’s projected growth targets. The architecture had to pass before migration began.
03 The BuildThe engineering work covered three layers.
Data pipelines. Provectus built event-streaming pipelines using managed container orchestration and cloud storage on AWS. Appen’s existing data streams and annotation workloads migrated to these pipelines after testing and optimization. Data flows in real time from ingestion through processing to reporting.
Data lake and analytics. A cloud-native data lake replaced the legacy storage and query layer. Annotation teams and business units run on-demand queries for report generation. No more waiting for batch cycles. Dashboards reflect pipeline state as it happens.
Microservices and event backbone. Provectus decomposed the monolithic data workflows into dedicated microservices. Each handles a segment of the annotation pipeline in real time. An event-streaming layer serves as the communication backbone between services and databases. This pattern ensures that adding a new workflow type does not require changes to existing services.
Zero-downtime migration. A data synchronization approach fed records from the existing database into new services during cutover. Annotation teams kept working. No jobs were paused.
04 The ResultsThe platform changed how Appen’s annotation operations run day to day. Data teams see quality metrics and job status as events happen. Business units commit to client SLAs with real visibility into pipeline capacity.
30 min → under 1 min
Per-report generation time
30x faster than the legacy baseline
Processing speed across the full annotation pipeline increased 5x. Data annotation job throughput climbed 20x. The platform handles 1,000 parallel workflows with linear scalability: adding volume does not degrade performance for existing jobs.
Appen’s revenue depends on delivering training datasets faster than competitors. The operational gains map directly to client retention and deal velocity. Annotation teams catch quality issues in seconds, not after thousands of mislabeled items have shipped downstream.
05 What’s NextThe platform positions Appen to move into higher-margin AI data services. Model-assisted pre-labeling, where AI handles first-pass annotation and humans correct. Provectus continues to partner with Appen on extending the platform as new modalities and workloads arrive.