CrowdFlower’s Senior Data Scientist, Nick Gaylord delivered his presentation: “Machine learning with humans in the loop” at the AI Summit last week in New York, giving the audience an interesting view of how the company applies AI.
Gaylord opened up the conversation by listing the differences between structured and unstructured data by doing a comparison of both. Structured data is system-generated and purposefully designed, such as system logs and transaction records, Gaylord explains. The advantages are gaining known contents that are easy to query, whereas the challenges are algorithmic optimisation, where available information can be limited.
In unstructured data, described as “in the wild”, it is not created for automation, Gaylord explains, listing examples such as images and audio. The advantages are that it is abundant, natural and enables discovery, whereas the challenges are that framing effective questions can be difficult, as unlocking the information often is hard.
Gaylord moved on to breaking down what exactly makes unstructured data challenging, saying that the most common answer is: cleaning and selecting data to make it useful. Showing an infographic of what data scientists spend the most time doing, Gaylord explains that the most time-consuming task is undoubtedly cleaning and organising data, making out a whole 60% of the scale.
Other issues vary from collecting data sets (19%), mining data for patterns (9%), refining algorithms (4%), building training sets (3%) and others (5%). However, unstructured data questions often require informed human judgment too, Gaylord explains, which is also contributing to making it challenging. The key is to combine human intelligence with machine learning.
Supporting his statement he shows a case study conducted with the US Postal Service. Before the 1950s, human force conducted all sorting of mail. In 1959 they started with automatic reading of typewritten zip codes, and in 1972 sorting for more efficient delivery was implemented. In the 1990s, handwriting recognition was implemented, and today, humans are still on hand to help – showing that the importance of cooperation between humans and machines are still much present.
Gaylord also emphasised “the power of uncertainty”, saying how human labelling is doing more than offsetting low-confidence model predictions. Low-confidence predictions are less likely to be accurate, and human labels on these items help ensure quality across the entire dataset. This also provides a near-ideal set of additional training data, Gaylord says, strengthening the model at just the right points, every time it’s used.