May I have your attention please?

ArticleQ&ADetailDownload PDF

What is the nature of human intelligence? Can the computational approaches of modern Artificial Intelligence provide missing insights? Professor John K. Tsotsos, Distinguished Research Professor of Vision Science and Canada Research Chair of Computational Vision at York University, has spent several decades modelling and simulating AI. He focuses on visual perception, cognition and behaviour, and specifically studies one key element of intelligence, attention. He claims that attention comprises a set of mechanisms that allow the brain to control and tune information processing.
“Pay attention!” Words uttered by teachers in every classroom urging students to shift their focus back to the lesson: our world is filled with distraction and our brain’s sensory and cognitive machinery can be easily diverted. Attention is also a key consideration when developing artificial intelligence (AI). In a rapidly advancing world of robotics equipped with AI attempting to imitate human intelligence, the ability to attend to the relevant and ignore distractions becomes critical.

Professor John K. Tsotsos has spent over three decades advancing AI by studying and modelling human attentional abilities. He believes they constitute a prominent control mechanism for sensory cognition and behaviour. Specifically with respect to visual attention, how does the brain select a specific entity from a myriad of choices to focus on and then process it without distraction in order to achieve a desired behaviour?

The above image can help demonstrate the need for attention. Look at the picture and see if you can find a group of stars which, when connected by lines, forms a rough shape of a hand (like a constellation). Finding it difficult? There are too many subsets of stars to choose from and no context to use as a guide. We encounter this same problem every time we look at an image but we are able to use the context of colour, texture, and so on to limit our choices and attend to the more plausible and relevant subsets.

What are you reading?

While you read these words… are you only concentrating on the black ink on white paper? Is your mind wandering to the peripheral images and colours thus potentially interfering with your understanding of what you are seeing? How does your brain know to which objects to pay attention? If you consider this line of thought, the difficulty associated with modelling and simulating this complex selection system becomes apparent. An active partitioning and shuffling of information, feeding backward and forward in a sophisticated network, is required to enable human perception. When applied to AI, all of this must be accomplished with the finite processing speed and capacity of a computer processor.

Attentional control is what enables the functional generalisation of brain processes

Simulating an attentive human

Several models have been proposed to explain human attention. For example, in both information bottleneck models and capacity-limit models, concurrent activities may interfere with each other due to needing the same operational mechanism or lack of available processing resources. Many have therefore suggested that a communication network between the input and the brain should be able to select a specific activity. But this oversimplifies the problem; humans can deal with many things simultaneously (consider the act of driving a car). Researchers seek to understand how this is possible and then to embody that capacity into AI systems (such as autonomous cars).

Although attentive processes have been incorporated in AI widely, primarily as a way of limiting the amount of computation (or combinatorial explosion) needed to solve complex problems, they apply more broadly. Certainly, the breadth of human attentive abilities that have been studied is quite varied. Professor Tsotsos thus proposed that attention is the set of selection, suppression, and restriction mechanisms that tune and control the search processes inherent in perception and cognition. This has led to the Selective Tuning model of visual attention, first simulated by S. Culhane in 1992, which includes almost two dozen different mechanisms. He claims that ultimately attentional control is what enables the functional generalisation of brain processes.
Machine learning methods, specifically deep learning, have shown impressive results handling a variety of classically difficult problems in AI. Deep learning architectures use hierarchical, layered representations for computation (such as those shown in Fig 1a). As an input flows through such a pyramid representation, it can increasingly get intertwined and amalgamated with nearby signals (shown in Fig 1b). This interference impairs the easy separation of simultaneously relevant processes. Not only could such architectures benefit from mechanisms to select the relevant from the irrelevant (such as algorithms N. Bruce developed with Tsotsos) but they also need to address the problem of how to reduce the interference that comes from this entanglement. Even though deep learning methods have shown remarkable successes, their capacity to generalise so that the same network can perform a variety of tasks is quite limited.

Figure 1a: Imagine you are looking at an image of a group of letters, all “C”, arranged in a curve, one of them being red and the others blue, all pointing in different directions. The information from the image travels through several processing layers in this hypothetical hierarchy before reaching the neuron on the right which represents the red letter. The gray disk in the input shows the receptive field of that neuron, that is, the portion of the input the neuron sees. This is entirely a result of how neurons are connected in this hierarchy. This illustrates the context problem, that each neuron sees much more of an image than the part it is designed to recognise.

Figure 1b: Now suppose a different letter is green and you are asked “are the red and green letters pointing in the same direction?” Each point from each letter provides input through this hierarchy to the set of neurons within the shaded red and green disks; the shaded disks are the input’s projective field. Note the extensive overlap at the output layer, again due entirely to how neurons are connected, illustrating the interference or entanglement problem.

So many distractions

Figure 2a: in this three-layer network, there are three inputs in the form of colour: red, green and blue. As the information passes through the network convolutions, it becomes muddled. Not only are the colours blurred, they overlap making decisions difficult.

Figure 2b: In this version of the network, there is selectivity for red. ‘Non-red’ colours are suppressed – see the stronger lines from the red circles. As the information passes through the network, it becomes muddled to a lesser extent and, while blurring occurs, there is less overlap.

What is needed is a processing strategy that can enable this generalisation. This would require both a radically different architecture (incorporating aspects of Selective Tuning) as well as methods to morph a network from one task to another. This latter method Tsotsos has named Attentive Beamforming. Beamforming is a signal processing technique, used in sensor arrays to separate desired from interfering signals. Electromagnetic waves are additive by nature. Consider two waves originating from two different points. Similar to the ripples from two stones tossed into a calm lake, the waves meet and interfere with one another, the ripples growing or shrinking. Beamforming dynamically adjusts this constructive and destructive wave interference to optimise a desired signal source. Tsotsos sees a parallel in the brain and suggests that attentive beamforming predicts the signal of interest and then reduces any interference due to the entanglement inherent in pyramid representations. Figure 2 illustrates how selective tuning performs this process.

Experimental observations

Many AI researchers claim that their systems match human performance, but few have led to new discoveries relating to human abilities. On the other hand, the Selective Tuning model has made many such predictions which have been experimentally confirmed by many human vision researchers. Focussing on his own experiments, the first supported prediction in work with F. Cutzu, was that of an attentive suppressive surround. That is, and contrary to the current conventional wisdom, attending to one location does not make the perception of nearby things easier, it makes it harder because the brain suppresses the local area to increase the visibility of what is being attended, as Figure 1c shows.

With J.-M. Hopf, M. Bartsch, C.N. Boehler and others at the Otto von Guericke University of Magdeburg Germany, Professor Tsotsos tested other predictions using magnetoencephalographic recordings. Magnetoencephalography is a neuroimaging technique used for mapping brain activity. In one experiment, human observers were asked to focus their attention to a specified target and then interfering stimuli were presented at various distances from the target. The findings indicate that neural enhancement and suppression coexist in a spatially structured manner. Moreover, the suppression that occurs for stimuli near the attended one is due to a top-down processing wave following the initial bottom-up wave. Magnetoencephalography was also used to examine attentive colour selectivity in the human brain. Here, human subjects’ responses to a colour probe (interference) that varies in colour from a given colour target was analysed. The observations confirm the Selective Tuning prediction that attending to a particular colour suppresses the influences of nearby colours and this is due to recurrent processes. This functionality is not only true for colour; Tsotsos has also experimentally confirmed it for orientations (with M. Tombu), action selection (with D. Loach, A. Frischen and N. Bruce) and motion stimuli (with M. Fallah, J. Martinez-Trujillo, S. Treue and S.-A. Yoo). Most recently, with C. Wloka and I. Kotseruba, the model has been shown to generate eye movement sequences that match human eye movement behaviour to within 0.2% error, matching the inter-subject error of human observers, with the closest competitor at over 13 times higher error.

Attention is the set of selection, suppression, and restriction mechanisms that tune and control the search processes inherent in perception and cognition

Attention executive

Given the complexity and number of attentive processes, a natural question to ask is how these may be coordinated in order to achieve the subject’s goals. Different visual tasks each require different actions, different timings of actions, different durations, etc., and when complete, the process moves on to the next task. The actions require coordination and synchronisation and these are the responsibilities of an Attention Executive. Tsotsos, with colleague T. Womelsdorf, has noted a distinct cyclic activation of control signals that trigger the required timing synchronisations, with differing cycles corresponding to different visual tasks. In other words, these actions tell the visual system how to be more sensitive and selective for the relevant, how to suppress the irrelevant, how to order its processing timeline, how to make decisions, how to move the eyes to observe new scenes, and more, effectively turning a general-purpose system into one specialised for the current task. The system is morphed into a new special purpose system. It can then be morphed back to general purpose or into another kind of special purpose system. This is how generalisation is achieved.

Did you pay attention?

Now you begin to understand that attention is what enables the functional generalisation of brain processes. The overall system is highly dynamic and adaptive, changing to accommodate the current visual input and reason for viewing that input. Everything you see is processed according to why you are looking at it, which detail within the scene you need to pay attention to and what other details in the scene you need to find. Without an attentional executive to set this process in motion, coordinate, synchronise, parameterise, select the relevant, suppress the irrelevant, and monitor the various mechanisms at play, intelligent visual behaviour is not likely possible.

You do not mention “learning” anywhere in the development of your model. Does it play no role? Would a more data-driven approach be more fruitful?
The model was derived from first principles. That is, it starts directly at the level of established definitions, facts and theorems of computing theory without assumptions regarding empirical model and fitting parameters. There was no model fitting or parameter adjustments based on data for the overall structure and strategy. Understanding the solution search space and achieving its sufficient reduction were the keys. Typical data-driven learning has shown success at parameter value settings but not at model structure (e.g., number of network levels) or basic algorithms (for learning or decision-making). These have proved difficult to learn simply from statistical regularities in observations. Data plays a role after the structure and algorithmic strategies were developed.
Would a model based on quantum mechanics, for example superposition of states, provide any new insights into tackling this complex mechanism?
It is common to wonder whether a different methodology might provide new insights. I do not think this would be so for a quantum physics perspective on AI (see Tsotsos 1990). Modelling is quite dependent on having an understanding of the target – what exactly is it that we wish to model? For intelligence, we can’t even agree on a definition let alone a comprehensive description of what constitutes intelligent behaviour. So the problem is not the means of modelling, it is how to know if we have succeeded. My own work is purely computational and follows a classic scientific method in that as long as the model agrees with experimental observations and makes predictions of new behaviours that can be experimentally confirmed, then the model remains viable. The challenge is for other modelling philosophies to do the same, and to this point they have not.
Tsotsos, J., Exactly Which Emperor is Penrose Talking About?, The Behavioral and Brain Sciences, 13(4):686-687, (invited Commentary on R. Penrose, The Emperor’s New Mind, Oxford, 1989), 1990.
Age deteriorates human vision; would that apply to a fully functional AI?
The most obvious ‘ageing’ that an AI system would experience is the rapid turnover of software systems; but this is hardly the same thing. In general, there is little relationship between human ageing and computer ageing.
Preoccupation results in ‘staring into space’ sometimes. Would AI shut off visual input in such a scenario?
Since an AI is designed, it seems difficult to see what design criterion would lead to a “preoccupation mode”. Even if a human is preoccupied, there are triggers to snap one out of that state – loud noise, bright lights, etc. So no senses are turned off; rather, they are momentarily inhibited, but not so much that one cannot escape from that state: that would not have much evolutionary advantage. An AI should always be vigilant to its environment, so I do not see how a preoccupation mode could be useful.
Is the performance of the attentional executive what determines a human’s IQ?
Questions about human intelligence are beyond the scope of my research. Nevertheless, there is an aspect of this that falls out of my approach. The research began with an investigation of the computational complexity of vision – how difficult is the problem of vision? It revealed that, in principle, if one could simply lay out all possible perceptions, then for any given scene one need only search through that list to find the right one. This applies to a wide variety of problems not just vision. If you tested the abilities of any agent, human or artificial, the kinds of quantitative measures that are used include how accurately and how quickly a problem is solved. People who are faster and more accurate score better on IQ tests. In my model, it is the attention executive that controls the parameters of any search process within visual processing, so its quality determines the AI’s IQ. It is a natural hypothesis then that, since so much of the foundation of the model has demonstrated a strong relationship to human visual processing, perhaps this assertion is also true.

Research Objectives
The research of Professor Tsotsos explores three main themes: visual attention in humans and computer systems; computer vision; and visually-guided mobile robotics. The computational approach to the study of visual attention, with a focus on computational complexity, developed by Professor Tsotsos in the past 25 years is the subject of his book https://mitpress.mit.edu/books/computational-perspective-visual-attention
Funding

The Natural Sciences and Engineering Research Council of Canada
The Canada Research Chairs Program.

Collaborators
Dr Tsotsos has been lucky to collaborate with many colleagues, post-docs and students throughout his career. Only a few are mentioned within this article.
Bio
Prof Tsotsos obtained his PhD in 1980 from the University of Toronto and joined its faculty, founding its Computer Vision Research Group. In 2000, he moved to become Director, Centre for Vision Research at York University. His honours include Fellow Canadian Institute for Advanced Research, Fellow Royal Society of Canada, Fellow IEEE and the 2015 Sir John William Dawson Medal for excellence in multidisciplinary research.
Contact
Professor John K. Tsotsos
Dept. of Electrical Engineering and Computer Science,
Lassonde Bldg. 120 Campus Walk,
York University
4700 Keele St.
Toronto
Ontario, M3J 1P3
Canada
E: tsotsos@cse.yorku.ca
W: http://lassonde.yorku.ca/users/johntsotsos