“Pay attention!” Words uttered by teachers in every classroom urging students to shift their focus back to the lesson: our world is filled with distraction and our brain’s sensory and cognitive machinery can be easily diverted. Attention is also a key consideration when developing artificial intelligence (AI). In a rapidly advancing world of robotics equipped with AI attempting to imitate human intelligence, the ability to attend to the relevant and ignore distractions becomes critical.
Professor John K. Tsotsos has spent over three decades advancing AI by studying and modelling human attentional abilities. He believes they constitute a prominent control mechanism for sensory cognition and behaviour. Specifically with respect to visual attention, how does the brain select a specific entity from a myriad of choices to focus on and then process it without distraction in order to achieve a desired behaviour?

What are you reading?
While you read these words… are you only concentrating on the black ink on white paper? Is your mind wandering to the peripheral images and colours thus potentially interfering with your understanding of what you are seeing? How does your brain know to which objects to pay attention? If you consider this line of thought, the difficulty associated with modelling and simulating this complex selection system becomes apparent. An active partitioning and shuffling of information, feeding backward and forward in a sophisticated network, is required to enable human perception. When applied to AI, all of this must be accomplished with the finite processing speed and capacity of a computer processor.
Attentional control is what enables the functional generalisation of brain processes
Simulating an attentive human
Several models have been proposed to explain human attention. For example, in both information bottleneck models and capacity-limit models, concurrent activities may interfere with each other due to needing the same operational mechanism or lack of available processing resources. Many have therefore suggested that a communication network between the input and the brain should be able to select a specific activity. But this oversimplifies the problem; humans can deal with many things simultaneously (consider the act of driving a car). Researchers seek to understand how this is possible and then to embody that capacity into AI systems (such as autonomous cars).
Although attentive processes have been incorporated in AI widely, primarily as a way of limiting the amount of computation (or combinatorial explosion) needed to solve complex problems, they apply more broadly. Certainly, the breadth of human attentive abilities that have been studied is quite varied. Professor Tsotsos thus proposed that attention is the set of selection, suppression, and restriction mechanisms that tune and control the search processes inherent in perception and cognition. This has led to the Selective Tuning model of visual attention, first simulated by S. Culhane in 1992, which includes almost two dozen different mechanisms. He claims that ultimately attentional control is what enables the functional generalisation of brain processes.
Machine learning methods, specifically deep learning, have shown impressive results handling a variety of classically difficult problems in AI. Deep learning architectures use hierarchical, layered representations for computation (such as those shown in Fig 1a). As an input flows through such a pyramid representation, it can increasingly get intertwined and amalgamated with nearby signals (shown in Fig 1b). This interference impairs the easy separation of simultaneously relevant processes. Not only could such architectures benefit from mechanisms to select the relevant from the irrelevant (such as algorithms N. Bruce developed with Tsotsos) but they also need to address the problem of how to reduce the interference that comes from this entanglement. Even though deep learning methods have shown remarkable successes, their capacity to generalise so that the same network can perform a variety of tasks is quite limited.


So many distractions


What is needed is a processing strategy that can enable this generalisation. This would require both a radically different architecture (incorporating aspects of Selective Tuning) as well as methods to morph a network from one task to another. This latter method Tsotsos has named Attentive Beamforming. Beamforming is a signal processing technique, used in sensor arrays to separate desired from interfering signals. Electromagnetic waves are additive by nature. Consider two waves originating from two different points. Similar to the ripples from two stones tossed into a calm lake, the waves meet and interfere with one another, the ripples growing or shrinking. Beamforming dynamically adjusts this constructive and destructive wave interference to optimise a desired signal source. Tsotsos sees a parallel in the brain and suggests that attentive beamforming predicts the signal of interest and then reduces any interference due to the entanglement inherent in pyramid representations. Figure 2 illustrates how selective tuning performs this process.
Experimental observations
Many AI researchers claim that their systems match human performance, but few have led to new discoveries relating to human abilities. On the other hand, the Selective Tuning model has made many such predictions which have been experimentally confirmed by many human vision researchers. Focussing on his own experiments, the first supported prediction in work with F. Cutzu, was that of an attentive suppressive surround. That is, and contrary to the current conventional wisdom, attending to one location does not make the perception of nearby things easier, it makes it harder because the brain suppresses the local area to increase the visibility of what is being attended, as Figure 1c shows.
With J.-M. Hopf, M. Bartsch, C.N. Boehler and others at the Otto von Guericke University of Magdeburg Germany, Professor Tsotsos tested other predictions using magnetoencephalographic recordings. Magnetoencephalography is a neuroimaging technique used for mapping brain activity. In one experiment, human observers were asked to focus their attention to a specified target and then interfering stimuli were presented at various distances from the target. The findings indicate that neural enhancement and suppression coexist in a spatially structured manner. Moreover, the suppression that occurs for stimuli near the attended one is due to a top-down processing wave following the initial bottom-up wave. Magnetoencephalography was also used to examine attentive colour selectivity in the human brain. Here, human subjects’ responses to a colour probe (interference) that varies in colour from a given colour target was analysed. The observations confirm the Selective Tuning prediction that attending to a particular colour suppresses the influences of nearby colours and this is due to recurrent processes. This functionality is not only true for colour; Tsotsos has also experimentally confirmed it for orientations (with M. Tombu), action selection (with D. Loach, A. Frischen and N. Bruce) and motion stimuli (with M. Fallah, J. Martinez-Trujillo, S. Treue and S.-A. Yoo). Most recently, with C. Wloka and I. Kotseruba, the model has been shown to generate eye movement sequences that match human eye movement behaviour to within 0.2% error, matching the inter-subject error of human observers, with the closest competitor at over 13 times higher error.
Attention is the set of selection, suppression, and restriction mechanisms that tune and control the search processes inherent in perception and cognition
Attention executive
Given the complexity and number of attentive processes, a natural question to ask is how these may be coordinated in order to achieve the subject’s goals. Different visual tasks each require different actions, different timings of actions, different durations, etc., and when complete, the process moves on to the next task. The actions require coordination and synchronisation and these are the responsibilities of an Attention Executive. Tsotsos, with colleague T. Womelsdorf, has noted a distinct cyclic activation of control signals that trigger the required timing synchronisations, with differing cycles corresponding to different visual tasks. In other words, these actions tell the visual system how to be more sensitive and selective for the relevant, how to suppress the irrelevant, how to order its processing timeline, how to make decisions, how to move the eyes to observe new scenes, and more, effectively turning a general-purpose system into one specialised for the current task. The system is morphed into a new special purpose system. It can then be morphed back to general purpose or into another kind of special purpose system. This is how generalisation is achieved.
Did you pay attention?
Now you begin to understand that attention is what enables the functional generalisation of brain processes. The overall system is highly dynamic and adaptive, changing to accommodate the current visual input and reason for viewing that input. Everything you see is processed according to why you are looking at it, which detail within the scene you need to pay attention to and what other details in the scene you need to find. Without an attentional executive to set this process in motion, coordinate, synchronise, parameterise, select the relevant, suppress the irrelevant, and monitor the various mechanisms at play, intelligent visual behaviour is not likely possible.
The model was derived from first principles. That is, it starts directly at the level of established definitions, facts and theorems of computing theory without assumptions regarding empirical model and fitting parameters. There was no model fitting or parameter adjustments based on data for the overall structure and strategy. Understanding the solution search space and achieving its sufficient reduction were the keys. Typical data-driven learning has shown success at parameter value settings but not at model structure (e.g., number of network levels) or basic algorithms (for learning or decision-making). These have proved difficult to learn simply from statistical regularities in observations. Data plays a role after the structure and algorithmic strategies were developed.
Would a model based on quantum mechanics, for example superposition of states, provide any new insights into tackling this complex mechanism?
It is common to wonder whether a different methodology might provide new insights. I do not think this would be so for a quantum physics perspective on AI (see Tsotsos 1990). Modelling is quite dependent on having an understanding of the target – what exactly is it that we wish to model? For intelligence, we can’t even agree on a definition let alone a comprehensive description of what constitutes intelligent behaviour. So the problem is not the means of modelling, it is how to know if we have succeeded. My own work is purely computational and follows a classic scientific method in that as long as the model agrees with experimental observations and makes predictions of new behaviours that can be experimentally confirmed, then the model remains viable. The challenge is for other modelling philosophies to do the same, and to this point they have not.
Tsotsos, J., Exactly Which Emperor is Penrose Talking About?, The Behavioral and Brain Sciences, 13(4):686-687, (invited Commentary on R. Penrose, The Emperor’s New Mind, Oxford, 1989), 1990.
Age deteriorates human vision; would that apply to a fully functional AI?
The most obvious ‘ageing’ that an AI system would experience is the rapid turnover of software systems; but this is hardly the same thing. In general, there is little relationship between human ageing and computer ageing.
Preoccupation results in ‘staring into space’ sometimes. Would AI shut off visual input in such a scenario?
Since an AI is designed, it seems difficult to see what design criterion would lead to a “preoccupation mode”. Even if a human is preoccupied, there are triggers to snap one out of that state – loud noise, bright lights, etc. So no senses are turned off; rather, they are momentarily inhibited, but not so much that one cannot escape from that state: that would not have much evolutionary advantage. An AI should always be vigilant to its environment, so I do not see how a preoccupation mode could be useful.
Is the performance of the attentional executive what determines a human’s IQ?
Questions about human intelligence are beyond the scope of my research. Nevertheless, there is an aspect of this that falls out of my approach. The research began with an investigation of the computational complexity of vision – how difficult is the problem of vision? It revealed that, in principle, if one could simply lay out all possible perceptions, then for any given scene one need only search through that list to find the right one. This applies to a wide variety of problems not just vision. If you tested the abilities of any agent, human or artificial, the kinds of quantitative measures that are used include how accurately and how quickly a problem is solved. People who are faster and more accurate score better on IQ tests. In my model, it is the attention executive that controls the parameters of any search process within visual processing, so its quality determines the AI’s IQ. It is a natural hypothesis then that, since so much of the foundation of the model has demonstrated a strong relationship to human visual processing, perhaps this assertion is also true.
The research of Professor Tsotsos explores three main themes: visual attention in humans and computer systems; computer vision; and visually-guided mobile robotics. The computational approach to the study of visual attention, with a focus on computational complexity, developed by Professor Tsotsos in the past 25 years is the subject of his book https://mitpress.mit.edu/books/computational-perspective-visual-attention
Funding
- The Natural Sciences and Engineering Research Council of Canada
- The Canada Research Chairs Program.
Collaborators
Dr Tsotsos has been lucky to collaborate with many colleagues, post-docs and students throughout his career. Only a few are mentioned within this article.
Bio
Prof Tsotsos obtained his PhD in 1980 from the University of Toronto and joined its faculty, founding its Computer Vision Research Group. In 2000, he moved to become Director, Centre for Vision Research at York University. His honours include Fellow Canadian Institute for Advanced Research, Fellow Royal Society of Canada, Fellow IEEE and the 2015 Sir John William Dawson Medal for excellence in multidisciplinary research.
Contact
Professor John K. Tsotsos
Dept. of Electrical Engineering and Computer Science,
Lassonde Bldg. 120 Campus Walk,
York University
4700 Keele St.
Toronto
Ontario, M3J 1P3
Canada
E: tsotsos@cse.yorku.ca
W: http://lassonde.yorku.ca/users/johntsotsos