Why you should manage your research data

JISC is publishing a new guide: “How and why you should manage your research data: a guide for researchers.An introduction to engaging with research data management processes.”
JISC is the UK higher, further education and skills sectors’ not-for-profit organisation for digital services and solutions.
This guide provides an introduction to engaging with research data management processes and is dedicated to researchers and research data management support staff.

As for JISC, most of the activities involved are: naming files so you can find them quickly; keeping track of different versions, and deleting those not needed; backing up valuable data and controlling who has access to your data.

Research data life diagram_v4.ai

Apart from this very interesting and useful guide, JISC recommends the following training programmes available online, mostly originating from Jisc funding:

  • Mantra – a free online course designed for researchers or others who manage digital data as part of a research project
  • TraD – includes a blended learning course for those in (or expecting to be in) research data management support roles
  • RDMRose – an open educational resource for information professionals on research data management

Three ways to use WRDS

WRDSDid you know? WRDS offers 3 ways of accessing data: On the website using form-based queries, through a UNIX terminal session, or using PC-SAS on your desktop computer. Each method has its own benefits and strengths.

WRDS provides a common interface to a variety of databases in order to make the process of extracting data more simple for you. Read more on WRDS website and learn how to get the most out of WRDS access modes.

And to learn more about data sets that the library subscribes to, connect to the library web site.

Managing and sharing your research data

managingThe LSE Blog publishes a review of the Book “Managing and Sharing Research Data: A Guide to Good Practice” written by Louise Corti, Veerle Van den Eynden, Libby Bishop & Matthew Woollard.

Emily Grundy, author of the review says that this guide can be useful for students and researchers because data sharing has become much easier due to technical advances which have simplified the procedure of acquiring and processing data. However, the whole business of depositing, managing, acquiring and using data responsibly has become more complicated, not least because of the wealth of data sets of different types now available.

Chapter 1 sets out the case for managing and sharing research data. The authors argue that published findings should be replicable, and publicly funded research should be regarded as a common good. Moreover, duplicating data collection exercises is wasteful and may also tend to increase response burden (if people are asked to participate in multiple surveys) and lead to lower response rates.

Subsequent chapters follow with more guidance on planning for data sharing throughout the data life cycle, documentation, data management, formatting and storage. Chapter 7 considers in more detail the legal and ethical issues involved in sharing research. This includes a consideration of statistical disclosure techniques, an important and expanding area of research and practice which is often poorly understood. Suggestions are made about how to reduce risks of being able to identify individuals in data sets, in addition to removing personal identifiers. All of this is sensible, although such practices do limit the potential for further use in ways that may not be foreseen at the time of deposition.

Read more here

Library subscriptions on WRDS platform

WRDSThe Library subscribes to the Wharton Research Data Services (WRDS), which is a data research platform and business intelligence tool for academics. “Developed in 1993 to support faculty research at the Wharton School of the University of Pennsylvania, WRDS has since evolved to become the leading business intelligence tool for a global research community of 30,000+ users at over 350 institutions in 33 countries” says WRDS presentation on its website.

The library subscribes to various data sets including the following:
– Audit analytics data
– Bank Regulatory
– blockholders
– Bureau van Dijk
– Capital IQ
– Center for Research in Security Prices (CRSP)
– CENTRIS-Monthly Insights Reports
– CBOE indexes
– Compustat North America
– CUSIP Master File
– DMEF
– Dow Jones
– Eventus
– Execucomp
– Fama-French (research) Portfolio and Liquidity Factors
– FDIC
– I / B / E / S
– IRI Marketing Factbook: Point of Sale – Penn World Tables
– People Intelligence
– Philadelphia Stock Exchange United Currency Options
– Risk metrics
– SEC Disclosure of Order Execution
– TRACE

More information here

Wiley survey: How and why researchers share data

DataEarlier this year, Wiley conducted a survey on researchers views of data sharing.  The publisher contacted 90,000 researchers across a wide array of disciplines and received more than 2,250 responses from individuals engaged in active research programs.

“First, of the 52% of respondents who said they had made their data publicly available, the largest proportion (67%) did so via supplementary material in journals.  That subset more or less tallies with the average take-up we see at Wiley of the supporting information facility (c. 30%), but this varies considerably by discipline, and of course not everything in supporting information is data. Other ways in which researchers reported making data publicly available, such as in repositories (which are better suited to long term data management and preservation), are dwarfed by this proportion.”

See more on the Wiley Blog

Wiley

‘Capital in the twenty-first century’ sparking a debate

French economist Thomas Pikkety’s recent book Capital in the Twenty-First Century has been the subject of large scale debate.

His book studies the global dynamics of income and wealth distribution since 18c in 20+ countries, including particular case studies of the UK, USA and France. Alongside other economists, Pikkety studies historical data collected over the past 15 years, together with Atkinson, Saez, Postel-Vinay, Rosenthal, Alvaredo, Zucman, and over thirty others.
Some of the main ideas presented in this book are that inequality is in fact a feature of capitalism, that can only be reversed through state intervention. He suggests that the history of income and wealth inequality is always political, chaotic and unpredictable; it involves national identities and sharp reversals and therefore that nobody can predict the reversals of the future. He argues that the trend towards greater inequality was reduced through major historical events such as world wars, which forced governments to intervene with the redistribution of wealth. He speaks of a shift back to ‘patrimonial capitalism’ of inherited wealth and predicts low economic growth in the near future, despite technological advancement, and proposes a new wealth tax rate in order to combat this.

This book has been both praised and criticised by the academic community.

For example, nobel –prize winning economist Paul Krugman called the book a “magnificent, sweeping meditation on inequality” and “the most important economics book of the year — and maybe of the decade.”
Steven Pearlstein called it a “triumph of economic history over the theoretical, mathematical modeling that has come to dominate the economics profession in recent years.”

However, a large chunk of criticism comes from Pikkety placing inequality at the center of analysis without any reflection on why it matters or explaining its implications. Martin Wolf, suggest he merely assumes that inequality matters, but never explains why, only demonstrates that it exists and how it worsens. Clive Crook also underlines how, “Aside from its other flaws, ‘Capital in the 21st Century’ invites readers to believe not just that inequality is important but that nothing else matters. This book wants you to worry about low growth in the coming decades not because that would mean a slower rise in living standards, but because it might … worsen inequality.”

There has also been heavy criticisms of Pikkety’s methodology. Lawrence Summers claims her underestimated diminishing returns on capital, which would change the upper limits of inequality following his model. James K. Galbraith criticizes Piketty for using “an empirical measure that is unrelated to productive physical capital and whose dollar value depends, in part, on the return on capital. Where does the rate of return come from? Piketty never says,” and German economist Stefan Homburg (de) criticizes Piketty for equating “wealth” with “capital”.

A lot of focus has been put on not only lexical differences, but a focus on qualitative and quantitative methods. Inequality itself can hold both qualitative and quantitative connotations, as it does not solely rely on numerical wealth. With this being such a central factor, it needs to be strictly defined.
There are also claims that the acquisition of quantitative data from a variety of sources, and from the array of researchers required to obtain necessary figures across the studied period, were also mixed in a questionable manner, which would impact on results exiting the economic model.

A further list of reviews and criticisms can be found here

So whilst Pikkety’s book was groundbreaking, it does not come without controversy

Make up your own mind – you can find this book at the library, with the code 2-2434 PIK

On The Future Of Statistical Languages

Seth Brown, a data scientist in the telecommunications industry, has recently written an article on his blog entitled ‘On The Future Of Statistical Languages’

This article analyses the current state of statistical languages in use, and gives a justified response in predicting the future path in this field.

There currently exists a crossover between programming and statistical languages, with most statistics languages containing limited programming functions and vice versa. In order for this area to further develop, bridges need to be built to improve this crossover and make each area less exclusive, with a view to creating ‘an efficient, modern data analysis workflow’

Taking the authors example, use of languages ‘R’ and ‘Python’ amongst others, and the need to transfer between them top complete different tasks within the framework of the same project is inefficient. Whilst not intending to critique the current languages on offer, the author goes on to advocate ‘rich data analysis API no top of a more general open source programming language’.

But what does this mean for the future?

New inventions, technologies and increased reliance on digital devices mean that the amount of data collected is growing exponentially. To this end, the author proposes languages to focus on the statistical side, and to leave ‘the nuts and bolts’ of language design to its own experts. The language needs to be easy to understand, and approachable for students/statisticians/scientists using it so that they can build on the data collected and methodology instead of focusing on the language.

It should be free. Similar to current examples such as MATLAB, SPSS and Stata, to prevent monopolization and ensure that data can be shared across platforms and research can be advanced, avoiding as the author suggests, ‘The Microsoft Word problem’.

New tools should not be limited by domain specific languages or sunken-cost projects – as the Haskell case has proved.

The author then, suggests that going forward, ‘Python is the most obvious choice’ as it is currently already widely used, and already has tools in place. But even this needs improving in terms of accessibility to non-programmers, adopting a more user friendly environment, and deeper reach into academia, and thus instigate the construction of the programming-statistical bridge.

On The Future Of Statistical Languages’ / Seth Brown. 18th December 2013. On the blog “Dr. Bunsen

General rules and basic information about data citation

In the same way that you cite journal articles and books you reference in your publication, you may also need to cite any data your publication uses.

Citing data sets (spreadsheets etc.) is necessary to provide context and to give credit to your research.

Some style guides provide instructions for the citation of data, but if you can’t find a list of general rules, then consider these few elements when building your data citation:

Author: Creator of the data set (individual, group of individuals, organization).

Title: Title of the data set or name of the study.

Edition or Version: Version or edition number associated with the data set.

Date: Year of data publication.

Editor: Person or team responsible for compiling or editing the data set.

Publisher (= distributor): Entity (and location) responsible for producing and/or distributing the data set.

Producer: Organization that sponsored the author’s research and/or organization that made the creation of the data set possible, such as codifying and digitizing the data.

Material: Computer file or online article.

Electronic Retrieval Location: Web address where the data set is available including persistent identifier like DOI.

Examples using these General Rules:

APA (6th edition)

Smith, T.W., Marsden, P.V., & Hout, M. (2011). General social survey, 1972-2010 cumulative file (ICPSR31521-v1) [data file and codebook]. Chicago, IL: National Opinion Research Center [producer]. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor]. doi: 10.3886/ICPSR31521.v1

MLA (7th edition)

Smith, Tom W., Peter V. Marsden, and Michael Hout. General Social Survey, 1972-2010 Cumulative File. ICPSR31521-v1. Chicago, IL: National Opinion Research Center [producer]. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2011. Web. 23 Jan 2012. doi:10.3886/ICPSR31521.v1

Chicago (16th edition) (author-date)

Smith, Tom W., Peter V. Marsden, and Michael Hout. 2011. General Social Survey, 1972-2010 Cumulative File. ICPSR31521-v1. Chicago, IL: National Opinion Research Center. Distributed by Ann Arbor, MI: Inter-university Consortium for Political and Social Research. doi:10.3886/ICPSR31521.v1

Also note that versions X5 and above of Endnote have a template for the ‘dataset’ reference type.

See the Endnote manual for information on how to use this reference type if you are unsure.

The library will be happy to help with data set citation.