Balancing the Metrics of Research Evaluation
Research evaluation is one small part of the human condition, and one in which that expression of performance is increasingly communicated through the abstraction of numbers. But in the middle of these data, we mustn’t lose sight of the essential humanity of the endeavor, an essence that –perhaps – can’t be abstracted. Although there is a path that is emerging, there are challenges to be faced in the future, balances to be found between abstracted data and human need.
It is important to us as thinking beings to explain cause and effect, to measure it, to influence it. In the absence of known theory, we do our best, with the tools that are available to us. We work towards a consensus – hopefully using scientific methods -- and (hopefully) we achieve a broadly accepted theoretical framework.
As an illustration of where we are, and how we got here, I want to begin by considering two areas, both of which I feel have some relationship with how research metrics are being developed and used. The first comes from ideas of human management, or human resource management. The second comes from engineering: the science of feedback.
Life used to be so simple when it came to managing human resources. To motivate people, you’d show them an example of what to do, and then punish them if they didn’t raise their game. Take the example of General Drusus, whose life is immortalized in the Drusus Stone, in Germany. Aged only 30 at his death, the monument was constructed in his honor for having brought about peace in that part of the Roman Empire. However, the raising of the masonry block – roughly two stories in height – was driven by the motivation of his legionnaires. Drusus was popular, but he was also stepson to the Roman Emperor Augustus and brother to the subsequent Emperor, Tiberius.
In modern terms, we might present chemists with the example of Nobel Laureate Greg Winter, whose discoveries enabled modern cancer treatments using monoclonal antibodies, and who founded an industry worth hundreds of millions of dollars. Having shown them such an example, powerful entities might threaten to defund their labs if they didn’t produce in like fashion. This may sound far-fetched, but there are examples in the present time and in scholarly research.
It took until the beginning of the twentieth century for industry to start examining personnel management more seriously. And it didn’t emerge from any moral or ethical drive; rather it was pragmatically borne by the economic and population crises that followed the world wars. It was driven by the need to rebuild countries; and to accommodate emerging labor organizations and democracy and social ambition. The first formal attempts at understanding the human element in work – as compared to the “unit of production” approach of Taylorism – were explored in the 1950s, based on a scientific hypothesis inspired by new ideas of inheritance and inherent, unchangeable qualities, which were the behavioral science and psychology of the time[BM(U1] . The 1960s and 70s saw the introduction of goal-oriented appraisal, and, for the first time, the subject, the employee, became able to reflect on his or her own performance. Over the last two decades, we have seen the growth of 360 Appraisal.
From a personal perspective, the start of my career was marked by line managers telling me how well I’d done. For a few years after that, I was asked how well I was doing. And now, for the last few years I’m asked, how well does the company accommodate you, what can they do better. (Not just at Digital Science - although it’s a great place to work!)
In 2000 years, then, we have come a long way. We’ve come from a combination of “be like this”, “do as I say”, through “you are what you are, and that’s a fixed quantity”, to a much more sophisticated concept. How do you fit in a system, how do we optimize you “the human”, inside this complex network of interdependencies? In short, we have abandoned a STEM-like “scientific” approach in favor of a more discourse-focused, human-centered, experiential process.
An opposite trend may be observed in the fields of engineering, computer science, and mathematics. The notion of a system that receives feedback and responds accordingly to concepts at the heart of any performance management system was formed in ancient Greece, further developed by Arabic cultures and finally flourished in the industrial revolution of the 1600s. Time was always the driver; for 1500 years, humanity was obsessed with accurate time-keeping. In the 1600s, feedback mechanisms became essential to governing the speed of steam machines and mills. Engineers began to use the mathematics of feedback science to predict and develop mechanisms as part of the system, rather than deploying them in an ad hoc manner to control unruly machines. We see the genesis of hypothesis driven research in this field, rather than trial-and-error experimentation. In the 1950s, Soviet scientists made huge theoretical breakthroughs to support their space program; math and computer science have combined to give us all miniaturized devices that have more positional accuracy than was conceived of only a few years ago.
Thus, we have two very different approaches to feedback, correction, and evaluation. An approach to managing humans, increasingly more humane over the decades (and as a more dogmatic scientific approach fails to produce rewards); and an approach best suited to systems (even systems that involve humans), requiring a rigorous, theory-based approach to control. How do these apply to the “business” or “industry” of research?
I think that we must be willing to view one of the contexts of research evaluation as part of the feedback loop of “research as a business” and the expectation of a return on investment in that research. John Harrison - who invented the first clock sufficiently accurate to compute longitude on the sea - was supported financially by the British Government, who stood to gain massively by the increased navigational efficiency of their fleet. In that instance, it’s worth observing that the Government refused to accept that he had performed adequately to merit winning more than three million dollars they’d established as the award, and that Harrison had to resort to any number of tactics to maintain a financial life-line.
Researcher and funder fall out over results. The sun never sets on that one.
Today, research is a well-funded industry. Digital Science’s Dimensions application has indexed 1.4 trillion dollars of research funding; and a variety of outputs from that funding. 100 million publications. Nearly 40 million patents. Half a million clinical trials, a similar number of policy documents. One might be crude, and take one number, divide it by another, and come to some conclusions about the productivity, but such “analysis” would likely be unhelpful in any context.
According to researchers Kate Williams and Jonathan Grant, one of the most decisive steps towards a metrics-centered view of research evaluation almost happened in Australia, in 2005. This decision was explicitly based on a political commitment to strengthen links between industry and universities. The proposed Research Quality Framework focused on the broader impact of research, as well as its quality. This plan was eventually abandoned, largely due to political change. Nevertheless, the plan was hugely influential on the UK’s proposal to replace its Research Assessment Exercise with a system based on quantitative metrics. One obstacle that came up (according to Williams and Grant) was the explicit “steering of researchers and universities”. However, the UK finally adopted its new framework – the Research Excellence Framework, or REF – although the impact portion was a much reduced percentage, initially set at 20%, rising to 25% in 2017.
The movement towards greater reliance on evaluative metrics as a segment of the research cycle has inspired appropriate responses. Whether DORA, the Leiden Manifesto, or the Responsible Metrics movement, we see positions forming on what may be appropriate and inappropriate use, responsible or irresponsible.
For me, this offers an interesting dichotomy. If we take a hypothetical example of a metric that is created between many citations by many papers, for example. The simplest way to do this is to divide the former by the latter, which is probably the most commonly done thing. It’s certainly well understood by the clear majority of people. And yet, it’s highly misguided. That simple math works well if you have an approximate balance between high and low cited documents. A case that simply never happens in citation data – where you always have many low performing documents, and a small number of high performing documents. Using such simple, but misleading mathematics, may result in the conclusion that the clear majority of documents are “below average”. Which is supremely unhelpful.
The Leiden Manifesto elegantly observes that: “Simplicity is a virtue in an indicator because it enhances transparency. But simplistic metrics can distort the record (see principle 7). Evaluators must strive for balance — simple indicators true to the complexity of the research process.”
My experience is that although nearly everyone is happy with “divide one number by another”, as soon as we introduce some better mathematical practice – for example, calculate the exponential value of the arithmetic mean of the natural logs of the citations to reduce the effect of the small number of highly cited articles – the eyes of the audience glaze over. Even if this does result in an average value that is much “fairer” and “more responsible” than the arithmetic mean.
Finding this balance – between accessibility and fairness – is even more critical when it comes to considering the changing population of people who are using metrics. Every week, on various email lists, we see people posting messages akin to “Hi, I’m reasonably new to metrics, but my head of library services just made me responsible for preparing a report … and how do I start?”
Initiatives such as the LIS-Bibliometrics events, under the watchful eye of Dr. Lizzie Gadd and the Metrics Toolkit, are essential components in supporting the education and engagement of the new body of research assessment professionals.
Let’s focus on a bigger question: what are we trying to achieve with research metrics and evaluation? Are there two different things going on here?
We are engaged on a human endeavor. For example, researching Alzheimer’s disease. What strategies are useful; what are we trying to cure, prevent, slow down, ameliorate? For widespread populations? Within families or for an individual? What funding works? What drugs? What areas should be de-invested, perhaps just for the present time? Are there any effective governmental policies that can help shift the curve? In the field of metrics and evaluation, a key part of the work is trying to understand the extremely complex relationships and interdependencies within topics.
We have other questions. How well is a lab or a funder or a researcher or a method performing? What can we do to optimize their public engagement, or international collaborations? Some of these components work more effectively than others, under different circumstances. They respond differently: in short, although they are complex, they are principally human artefacts. And, because they are mostly human artefacts, they are capable of reflection and change.
There are two other areas where feedback and evaluation have been a crucial factor in the development of human performance and system efficiency. The first is the increasingly human-centric analysis practiced by corporations in the pursuit of increased excellence; the second being the great theoretical, mathematical, and computational breakthroughs that have revolutionized all forms of technology.
At the current time, it feels that we are standing at a fork. On the one hand, we could use big data, network theory, advanced visualizations, AI, and so on to really dig into research topics, to spotlight new ideas and insights regarding performance in this area of human society. On the other hand, we have the increasing impact of research metrics on individual humans, and the need to be acceptable to the broadest possible slice of our community.
These two things are incompatible. When I think about the analytics that are possible with a modern approach to research metrics, I often think of the work of Professor Chaomei Chen at Drexel. Chaomei has been working for several years on deep analysis of the full text of research articles. His goal is to map the progress of a topic as it goes from being uncertain (“it is suggested that virus A is implicated in condition B”) to certainty (“B is caused by A”). The technological approach is heavily based on many theoretical approaches, which Chaomei can present using highly informative visualizations.
Although these visualizations can support the qualitative statements about the role of individuals, laboratories, or journals, that is not their purpose. They are designed to inform the trends, status, and progression of topic-based work.
When it comes to looking at individual components, I think there is another revolution that will come about. For years, we have been accustomed to thinking that metrics are a thing that happen to researchers; or (if you work in a research office) a thing that you do to yourself. The world is changing, and the new generation of researchers will be much more aware of their own standing, their own profiles, their own strengths, and their own ambitions. This is, after all, the selfie generation, and if the current massive trend towards sharing, collaboration, and open access that was inspired by the Napster generation (a high school graduate when Napster was launched is now in her late 30s), we are going to see a far more self-aware and self-reflective population of researchers than we’ve been accustomed to in the next twenty years.
The recent push towards “profiles” and the use of “baskets” (or “buckets”) of metrics is compatible with this generation, and is a start. We should be prepared for more of the same: and that includes investing in some of the concepts that we see in Human Resources (or “Talent Management” as we now see it called). For example, 360 reviews. Why shouldn’t a researcher be asking hard questions of a funder’s support? Or of a journal’s likelihood of promoting the research in the media? Or of the prospects of a promotion in a lab?
For my conclusion, I am extremely optimistic about the state of metrics. It seems that the conversations and movements are in the right direction – but both sides would benefit from more conversations about the purpose – and limitations – of the data driven approach.