Last year I had the honor of partnering with Phillip Hale to work on a synthetic data whitepaper and am thrilled to share our results. I was responsible for all data generation. I iterated with Phillip to test variables and refine the data generation pipeline.
I have found that iteration is often undervalued when it comes to synthetic data. It is often much more effective to take a small team and a more reduced approach to data generation but iterate daily, than to build a complex pipeline with a large team with one month turn arounds.