In March of 2016, the number of web pages reached 1 billion. The growth rate of the amount of surrounding knowledge continues to outpace all past projection. Yet, the rate at which we absorb information hasn’t changed. Can machines help us with this information overload?
By Dominika Tkaczyk
Crucial for our very existence, information processing plays a significant part in our everyday lives. Each second, huge amounts of information of a varying nature are exchanged in order to spread ideas and propagate knowledge, improve our quality of life, warn about danger, and serve many other purposes. Good quality information, used at the right moment, and in the right way can make all the difference – both in our personal and professional lives.
Some of the greatest achievements and inventions grew out of the need to facilitate the process of information dissemination. The emergence of spoken language allowed for fast communication, including exchanging abstract ideas. The invention of writing made it possible to spread information between people who never met in person, or didn’t even live at the same time. In turn, printing enabled easy reproduction of written text on a massive scale.
Each of these breakthroughs took information exchange to the next level by overcoming certain natural limitations, in turn greatly affecting the volume of available information. For example, before the invention of the printing press, the number of copies of a book was naturally limited by the time needed to manually rewrite it. However, all these leaps in volume pale in comparison to the results of the digital revolution we have witnessed within the last few decades.
Among other changes in almost every aspect of our lives, the digital revolution resulted in a major shift in the way information is exchanged. A large portion of communication has moved to electronic media, making it much easier to produce and store all kinds of content. Today, we can easily use a small device to carry thousands of electronic books everywhere, we have access to digital libraries with documents on every possible topic without even having to go out, and publishing new information is as easy as a mouse click. All this has resulted in yet another major increase in the volume of available information, on a scale never before observed, or even imagined.
Curse of abundance
According to Google’s estimates, nearly 130 million books have been published in all of modern history, with over 2 million new titles published every year. The number of existing web pages is estimated to have reached 1 billion in March of 2016, which is three times as many as five years ago, and 13 times as many as 10 years ago, according to InternetLiveStats.com. Over 220 million new pages appeared within the past year alone. A printed copy of the entire internet would use roughly 136 billion sheets of paper, which would require 16 million trees to be cut down, according to an estimation made by George Harwood and Evangeline Walker, students at the University of Leicester in the UK. Stacked one on top another, the column of paper would be over 13 km high. Every day, internet users publish four million blog posts and 600 million tweets.
At the same time, our information consuming abilities have not changed much, thus becoming a significant bottleneck in the information dissemination process. It still takes us roughly the same time to read and understand a book, a story, or even a short note, and the time needed is far too long in comparison to the growth rate of the information surrounding us.
For example, we would have to be able to read 46 blog posts per second to keep up with all the data being produced. And reducing the information only to topics we find interesting is not good enough anymore. A search for “global warming,” for instance, results in over 54 million documents (this includes web pages, images, books, videos and other document types). Even if we were able to understand a single document in one second, we would need 20 months of nonstop work to read and watch everything. And what about other interesting topics? Life suddenly seems extremely short.
Computers to the rescue
In cases of scale-related emergencies we typically turn to machines for help, and the problem of information overload is no exception. Luckily, we have already invented sophisticated algorithms for assisting people in consuming information. How can this be done? The top two approaches are information retrieval and information extraction.
In general, information retrieval aims at limiting the number of documents a person needs to read by intelligent filtering, and as such is the very task solved by modern search engines. The idea is as simple as employing machines to select relevant documents from a large document collection in response to a free text query entered by the search engine user, in other words – googling. Ideally, the query is an exact specification of the user’s information need and selected documents indeed contain the relevant answer.
Unfortunately, the filtering model is not sufficient anymore, as often the number of relevant documents is still too high to be useful for a human. The key functionality that search engines provide nowadays is in fact sorting the selected documents by their levels of relevance so that the most relevant documents are presented at the top of the resulting list. The sorting is typically based on the content of the documents (a document containing a large number of the input query words will be more relevant), the network of links between the web pages (a web page with a lot of incoming links is more likely to be relevant), and also user’s personal preference, which especially helps when the query is ambiguous: is a user who typed “python” into a search engine interested in zoology, buying a new pet, or learning a programming language?
Sorting the results allows the users to focus on a small number of the most relevant documents. The quality of relevance-based sorting can be indirectly assessed by examining a search engine’s click-through statistics, which show how often the documents at various positions in the result list are clicked. In Google’s search engine, about one-third of all the clicks are on the first search result, and 92 percent of them are related to the first 10 results, or the first result page. The second result page rarely gets the user’s attention. As one popular meme says: “The best place to hide a dead body is on page two of Google search results.”
Less is more
Information retrieval aims at limiting the fraction of the collection presented to the user, while keeping the documents themselves intact. Extraction techniques adopt a different approach, trying to downsize the information volume by manipulating the content of the documents.
One example of this approach is automatic summarization, which selects a subset of sentences and phrases that captures the essence of the original text precisely, while keeping the selected set small. This is done by examining various aspects of the sentences, such as: their position in the document (first and last few sentences of the document might be more important than others), their length (simple sentences are usually preferred over complex ones), the presence of key words related to the main topic, or specific phrases (such as “this document describes”).
Such a short, automatically compiled summary is usually not enough to understand every aspect of the document, but lets the user quickly decide whether the document is worth further examination, thus saving time spent on reading useless or irrelevant content.
Automatic summarization techniques can also be observed in action in Google’s search engine. When a user searches for a certain topic, intelligent algorithms extract a short fragment from one particular source and this summary is presented above the result list. In some cases, the user does not even need to click on any specific web page to obtain the information they seek.
Other examples of automatic information extraction techniques include answering the questions posed by people in natural language, sentiment analysis (aimed at detecting subjective traits in the text in order to determine the writer’s opinion about something) and extraction of the mention of specific facts from the text written in natural language (such as “Bill Gates founded Microsoft,” or “Microsoft is based in Redmond”).
Do you speak binary?
Unfortunately, all these tasks are based on reverse engineering principles and require computers to have some level of understanding of human languages. In practice, despite the decades of intensive research, natural language processing still poses a major challenge for automatic algorithms.
Natural languages, as developed by people and for people, are context-dependent, full of ambiguities and lack rigorous mathematical structure or foundations. Understanding subtle information like sarcasm or subjective opinions is a long way out of a machine’s comfort zone. As a result, the effects of automatic processing of documents written in natural language are not always satisfactory.
Is there another possibility? Instead of focusing on the consumer part, a completely different approach might be a global change in the way the information is produced and shared.
Right now, the vast majority of the information produced uses formats intended for humans, usually free text documents. But since we already produce far too much information to consume directly without the assistance of machines, maybe the information should be exchanged in a machine-readable form, allowing computers to efficiently and more accurately help users fulfill their information needs?
A simple idea is switching from exchanging unstructured information to machine-readable formats starting from the very beginning of a document’s life, when the meaning of various information pieces can be provided directly by the document’s author. For example, all the following sentences: “Johannes Gutenberg developed the first printing press in 1450,” “The first printing press was made by Johannes Gutenberg in 1450,” “What do we know about the life of Johann Gutenberg, the inventor of the printing press?” could be represented by a machine-readable tuple (data structure): “,” without much loss. The information in this form can be directly used by a computer to answer the question “Who invented the printing press?”
Some areas of activity are already structured in a machine-friendly manner. When you place an advert to sublet your apartment, you don’t generally type in a colorful description in a free text format. Instead, you usually fill out the form with data such as price, location, storey, furniture, year the building was built etc. Otherwise, a system (or a person) would have to dig this data out of the text, which is not always easy.
Ideally, if internet resources were impeccably structured, your Google searches would yield an executive summary rather than a collection of more or less useful links. For instance: if you typed in “homeopathy” you could get: first, a brief summary with a definition of the term; second, a table summarizing the results of all scientific studies ever carried out, complete with dates, sample sizes and an estimation of their credibility; then, the most relevant press releases both supporting and rejecting homeopathy, and finally, some social media output as the voice of the “regular people.”
The idea of disseminating information with directly specified structure and semantics is not new. The concept of the so-called Semantic Network Model, allowing the representation of semantically structured knowledge, was formed in the early 1960s. Tim Berners-Lee, the director of the World Wide Web Consortium, coined the term “Semantic Web,” which refers to enriching the network of web pages with machine-readable metadata about the pages and how they are related to each other. BernersLee defines the Semantic Web as “a web of data that can be processed directly and indirectly by machines.”
Switching from human-readable to machine-friendly representations for exchanging information is currently our best shot at curing the world’s information overload problem. We are already doing it in many cases. When searching for information, people usually type in keywords rather than full sentences, whereas employing hashtags is an attempt at grouping and structuring social media content.
Should we be afraid that machinereadable formats will entirely replace free text forms, effectively wiping out natural language? As long as people enjoy reading a good novel on a cold winter evening, natural languages should be safe. Even though “the times they are a-changin’,” a machine-readable version of Bob Dylan’s timeless classic would still sound ridiculous at a concert.