Scientists have developed a new artificial intelligence system that can more effectively extract data from the vast wealth of information present on the internet.
The data necessary to answer myriad questions – about, say, the correlations between the industrial use of certain chemicals and incidents of disease, or between patterns of news coverage and voter-poll results -may all be online in form of plain text. However, extracting data from plain text and organising it for quantitative analysis may be prohibitively time consuming.
Researchers from Massachusetts Institute of Technology (MIT) in the US developed a new approach to information extraction. Most machine-learning systems work by combing through training examples and looking for patterns that correspond to classifications provided by human annotators.For instance, humans might label parts of speech in a set of texts, and the machine-learning system will try to identify patterns that resolve ambiguities – for instance, when “her” is a direct object and when it is an adjective. Typically, computer scientists will try to feed their machine-learning systems as much training data as possible. That generally increases the chances that a system will be able to handle difficult problems. In the new research, scientists trained their system on scanty data.
Every decision the system makes is the result of machine learning. The system learns how to generate search queries, gauge the likelihood that a new text is relevant to its extraction task, and determine the best strategy for fusing the results of multiple attempts at extraction. The researchers compared their system’s performance to that of several extractor trained using more conventional machine-learning techniques. For every data item extracted in both tasks, the new system outperformed its predecessors, usually by about 10 per cent.