Whitepaper

Synthetic Invoice Dataset Generator

A synthetic dataset is a dataset generated by a program, not collected from real life. The goal of such datasets is to be flexible and rich enough to help conduct research with machine learning models.

Building an ML model for data recognition and extraction from invoices requires a sufficiently large annotated dataset for training. However, due to the sensitivity of information, such datasets are not publicly available. A document generator capable of generating data similar in format and content to examples of real-life invoices can be the solution.

This paper presents a method for creating a synthetic dataset represented as invoices. All documents have a unique render layout and design, and contain the variability of names of key fields, to approximate the documents to real examples of invoices.

As a result of this work, an artificial dataset for the purpose of preserving privacy was created that is able to create training data for machine learning algorithms. This work allowed us to create annotated sample invoices adequate for the learning and extraction of necessary information.

Download The Whitepaper

  • Hidden
  • Hidden
  • Hidden
  • This field is for validation purposes and should be left unchanged.

See the Provectus privacy policy for details on how we collect, use, and share information about you.

A synthetic dataset is a dataset generated by a program, not collected from real life. The goal of such datasets is to be flexible and rich enough to help conduct research with machine learning models.

Building an ML model for data recognition and extraction from invoices requires a sufficiently large annotated dataset for training. However, due to the sensitivity of information, such datasets are not publicly available. A document generator capable of generating data similar in format and content to examples of real-life invoices can be the solution.

This paper presents a method for creating a synthetic dataset represented as invoices. All documents have a unique render layout and design, and contain the variability of names of key fields, to approximate the documents to real examples of invoices.

As a result of this work, an artificial dataset for the purpose of preserving privacy was created that is able to create training data for machine learning algorithms. This work allowed us to create annotated sample invoices adequate for the learning and extraction of necessary information.