This document lists the ways you can improve the accuracy of your model.
When it comes to improvement, users generally have 2 levers to help improve their models. These levers being:
- Data (Images/PDFs)
Below are methods listed in order of importance about the things you can do.
Quantity of Data
- More data: More is always better. The most fundamental and effective way to improve the accuracy of your model is to add more data
- Dataset Diversity & Consistency: You should train the model with the kind of images you expect it to work on. If you have a large amount of data that you expect the model to work on, your training data should represent this
Quality of Data
- Readable: The document should readable by our OCR. When you draw a bounding box around the image, ideally the auto populated text should match. However a few characters that are incorrect or wrong is okay and you can correct these mistakes while annotating
- Blurry Images: If the images are blurry, and the text can't be read, populating the text might not be very effective. It is best to avoid blurry images.
- Consistency: It is important to follow the same convention when annotating data. For eg: If you've annotated the date and time in a receipt under the label `date`, make sure you follow the same practice in all receipts
- Completeness: Make sure you annotate all your images. In case the number of images is verty large, make sure the images you've annotated are complete. Do not partially annotate images. For eg: Do not annotate 5 out of 10 labels in an image.
Only annotate text you want: If you want to extract say the invoice number, only annotate the specific invoice number. The model will learn to look for the surrounding text in order to learn this. An example is given below.
- Multiline fields: The model will learn multi line fields as well such as addresses. As shown in the below image, you can annotate the entire address field.