Provenance of scientific data is a key piece of the metadata record for the data's ongoing discovery and reuse. Provenance collection systems capture provenance on the fly. However, the protocol between application and provenance tool may not be reliable. Consequently, the provenance record can be partial, partitioned, and simply inaccurate.
The Gigabyte Synthetic Database is a noisy data collection generated using the Workflow Emulator Tool (WORKEM) with a number of scientific workflow examples that includes modeled failures.