MIT just deleted a huge dataset used for training AI because the dataset contained image labels with racist and misogynistic slurs. This type of dataset is used to “teach” other software what words to use when labeling photos.
Racism and misogyny are built into our society. Software is created by humans, so racism and misogyny are frequently built into software that we create. Perhaps not intentionally, but that doesn’t matter when you look at the results.
How much software has been “taught” by this dataset to label photos of Black people with the n-word? (Yes, it’s in there.)
The dataset has been in use since 2008. It was built by a university, so people thought they could trust it. They were wrong. If you are creating software, you can’t just assume that any third-party data, libraries, etc. are okay, no matter how reputable the source.