The Problem with Data Validation

John Kratz of the California Digital Library recently published an article entitled ‘Fifteen ideas about data validation (and peer review)

He describes it as a “longish list of non-parallel, sometimes-overlapping ideas about how data review, validation, or quality assessment could or should work, ” and lays out fifteen observations and recommendations to improve the process.

Problems with data validation can sometimes arise, as academic researchers often only publish raw datasets alongside their articles.

As a result it sometimes becomes difficult to assess the reliability and relevance of this data.

Whilst, as the author notes, there are some mechanisms in place to validate data, they are severely lacking in comparison to those in place for example in terms of citations ; where several widely recognised styles are already present.

This is somewhat surprising ; data validation is clearly of high importance in assuring the credibility of an academic article, and therefore strong mechanisms and even a standard procedure should be in place to ensure that this is the case.

One of the ongoing themes which runs throughout Kratz’ ideas is the depth of which the data needs to be reviewed ; not only by one person, but divided up among people or even organisations. Both data and metadata should be reviewed, not only by other academics, but experts in the field, the community and the users of the data. Similarly, aside from mere validation, actual use of the data is a form of review in itself, and works to confirm the true relevance and application of the data to conclude whether it really is fit for purpose.

 View this video coming from Nature for more information on the subject:


Evaluating the h-index

Stacy Konkiel of ImpactStory  has published an article entitled “Four great reasons to stop caring so much about the h-index”  questioning the reliability of the h-index in assessing a scholars prominence.

ImpactStory is an “open-source, web-based tool that helps researchers explore and share the diverse impacts of all their research products—from traditional ones like journal articles, to emerging products like blog posts, datasets, and software.” Their aim is to create a new recognition system for scholars based on data and web impact.

The h-index was developed by Jorge E. Hirsch, and is an index that attempts to measure the productivity and impact of scholars via their published works. The index is based on a scholars most cited papers and publications, and the number of citations that they have received in other academic publications from other scholars.hindex

So why, according to Konkiel, should we ‘stop caring so much’ about it?

Firstly, Konkiel likens comparison via an h-index as ‘comparing apples and oranges.’
The h-index does not consider the field of study of an author. This means that finding a ‘good’ h-index comparatively across domains is difficult, as an author in medicine for example may have a much higher index than an author in mathematics, not necessarily because they are a better scholar, but simply because medicinal works may be published or cited more. Yet the h-index does not take this into account.

Furthermore, the h-index does not differentiate according to age or career advancement. A younger scholar will most likely have published fewer papers than one further along their career path, yet this is not taken into account. Similarly, there is even the presence of more than one h-index per author, depending on the databases consulted.

Secondly, Konkiel highlights the ignorance of articles that aren’t ‘shaped like an article.’ H-index only accounts for academic articles; therefore blog posts, patents, software and even some books are omitted from the count, which the author notes, affects the h-index in fields such as chemistry.
The analysed sphere of influence is also limited to academic citations, therefore even if an article had great social implications or forced a change in policy, this is not taken into account.

The author goes on to question the validity of assessing an author by a single number. One figure is too closed an area to assess a scholars full prominence, with parallels drawn with the limited accuracy of journal impact factors. Whilst it may provide valid information in one area, the basis of quantity vs influence does not create a full picture of an author.

Finally, Konkiel goes as far as to suggest the h-index is ‘dumb’ when it comes to authorship, as it does not consider different weighting depending on whether a paper was written alone or collaboratively, or the position of an author in a collaboration.

In conclusion, the author details some of the attempted ‘fixes’ for the h-index, none of which have been widely implemented.

Therefore she suggests looking at alternatives, which base their rankings on a wider range of data, for example altmetrics.

You can find out more about altmetrics here