Blaming data is all we have
Data-centric approaches is what giving more fruitful result then model-centric approaches. The industries are adopting more often the data-centric approach.
In the world of pre-trained models and transfer learning, everyone wants to achieve high accuracy. The performance should always be good
What if you did everything best in terms of frameworks, best practices, and architectures you still end up with lower performance.
There is a saying that if you feed garbage to a strong network ultimately you will end up with garbage as output.
Garbage IN Garbage OUT.
The clean data with a not-so-good model is always better than noisy data with the best model.
There are two ways of moving towards a data-centric approach
Cleaning the noise in data
Labeling the data to a ground truth that should be correct with this you are computing the ceiling of your model that is called as human-level performance
Cleaning data are the things that most of us would avoid but we should not. If you clean data you will know your data and if you know your data you will be able to draw conclusions.
Features are subjective but you will also know the features model is not learning, Human-level performance is always required.
Some Best practices while cleaning/labeling data
Try to label/clean fewer samples but do it efficiently
Try to take guidance and views from the team
Same Samples can be labeled more than two members then take the ground truth as the majority
Start with the Research and development loop(stated below) as soon as possible
What I mean by saying Research and development loop is in the following steps
Step-1 Collecting samples
Step -2 Training model
Step-3 Evaluating model
Step-4 Repeat from Step-1
I know cleaning and labeling data is sometimes boring, but remember you are step by step building a model that is way better and could be the next big thing.
I myself working on a live machine learning project have cleaned a lot of data the only motivation I got is to have faith in drawing conclusions from cleaner data the features I learn and can tell what exact problem the model is facing in a logical manner.