Methodology - Global Open Data Index: Survey

The Global Open Data Index collects and presents information on the current state of open data release around the world. The Global Open Data Index is run by Open Knowledge International with the assistance of volunteers from the Open Knowledge Network around the world. The first Open Data Index was released on October 28, 2013. This page explains the methodology behind the Global Open Data Index. If you have any further questions or comments about our methodology please reach out to the staff, community of volunteers, and Index reviewers on the Open Data Index forum.

The Global Open Data Index is not an official government representation of the open data offering in each country, but an independent assessment from a citizen’s perspective. It is a civil society audit of open data enabling citizens and governments to measure government’s progress on open data. The Index gives both parties a measurement tool and a baseline for discussion and analysis of the open data ecosystem in their country and internationally. The datasets that are taken into account seek to represent civil society’s preferences and therefore measure open data publication from a key user’s perspective (further details, see datasets section below).

The Global Open Data Index is not only a benchmarking tool, it also plays a powerful role in sustaining momentum for open data around the world - and in convening civil society networks to use and collaborate around this data. If, for example, the government of a country does publish an open dataset, but this is not clear to the public and cannot be found through a simple search, then the data can easily be overlooked and not put to good use. Governments and open data practitioners can review the Index results to see how accessible the open data they publish actually appears to their citizens, see where improvements are necessary to make open data truly open and useful, and track their progress year to year.

The research question

Like any other benchmarking tool, the Global Open Data Index tries to answer a question. In our case, the question is as follows:

“What is the state open data around the world?”

From this question, other important questions emerge, such as:

“Which country ranks best on open data? Who is the least/most open country?”
“What is the most open dataset? What is the least open dataset?”
“Are some data more readily published as open data than others?”

Open data has two key aspects: legal and technical openness. Which of these two — and which specific requirements e.g. an open license, machine readability, bulk access — is the most challenging for data publishers? For example, do governments find it easy to publish machine-readable data but struggle to apply an open license?_

According to the common open data assessment framework, there are four different ways to evaluate data openness — context, data, use and impact. The Global Open Data Index is intentionally limiting its inquiry to the publication of datasets by national governments. It does not look at the broader societal context — for example, the legal or policy framework, (FOI, etc.) — and it also does not seek to assess use or impact in a systematic way.

In contrast to past editions, the Index now also seeks to capture information on practical openness, i.e. data findability and usability. These questions are not currently scored but this information will provide valuable information for both governments and users.

The scored Open Data Index questions do not assess the quality of the data. This narrow focus of data publication enables it to provide a standardized, robust, comparable assessment of the state of the publication of key data by governments around the world. We are nevertheless aware that data quality is a key concern of the open data community and a significant barrier to reuse.

Research assumptions

Different countries have different governance structures (Federal vs. National government, etc.) and different policies regarding open data. We set out here our key assumptions that inform our approach and that were taken into consideration while collecting and assessing the data.

Assumption 1: Open Data is defined by the Open Definition We define open data according to the ‘Open Definition’— The open definition is a set of principles that define openness in relation to data and content. It is the original, “gold-standard” definition for open data. It is also simple and easy to operationalise. We note one small deviation from the current v2.1 of the Open Definition. The only part of our methodology that is not aligned with the open definition is “Open Machine readable” format. We give a full score to machine-readable formats whose source code is not open, but who are usable with at least one free and open source software in order to emphasise practical openness.

Assumption 2: The role of government in publishing data In the past, there have been questions in the index community about the role of the government in ensuring the publication of a specific dataset. In many fields, some of government services are privatised, which means the data is owned and produced by a company and not the state. Our assumption is that for the key datasets we survey, the government has a responsibility to ensure the availability of such data even if is it held and managed by a third-party.

Assumption 3: The Global Open Data Index is a national indicator Recognise that not all countries have the same governance structure and that data indicators assessed through the index might not necessarily be produced by the national government due to decentralisation of power. Furthermore, it is possible that not all of the sub-national governments produce the same data as they are potentially subject to different laws and/or procedures. Nevertheless, the Global Open Data Index assesses national governments by measuring the publication of open data at the country-level. Country-level data assessed here may take three forms: “National” publication of open data can take two forms

The data describes national government processes or procedures
The data is collected or produced by national government or a national government agency
The data describes national parameters and public services for the entire national territory, but is collected by sub-national governments.

Data Categories

Dataset definitions are crucial in enabling respondents to accurately assess datasets and to do in a way that is comparable across countries. Each year we refine our definitions to reflect learnings from experts in the field. The data definitions do the following:

Describe the dataset by at least 3 key data characteristics it must have.
Include a time interval for how often the dataset needs to be updated. We use the “Is this timely” question in the Index survey to assess whether data is published in a timely fashion. However, different datasets reasonably have different times in which they are updated. Adding this characteristic to the dataset definition can help users answering this question.
Aggregation. Mention which aggregation level the data needs to be in. Some datasets can be in more than one aggregation level and mentioning the aggregation level can help to avoid confusion between datasets.

Data Categories

National Statistics: Key national statistics such as demographic and economic indicators (GDP, unemployment, population, etc). To satisfy this category, the following minimum criteria must be met:

GDP for the whole country updated at least quarterly
Unemployment statistics updated at least monthly
Population updated at least once a year

Government Budget National government budget at a high level. This category is looking at budgets, or the planned government expenditure for the upcoming year, and not the actual expenditure. To satisfy this category, the following minimum criteria must be met:

Planned budget divided by government department and sub-department
Updated once a year.
The budget should include descriptions regarding the different budget sections.

Government Spending: Records of actual (past) national government spending at a detailed transactional level. A database of contracts awarded or similar will not be considered sufficient. This data category refers to detailed ongoing data on actual expenditure. Data submitted in this category should meet the following minimum criteria:

Individual record of transactions
Date of the transactions
Government office that made the transaction
Name of vendor
Amount of the transaction
Updated on a monthly basis

Draft Legislation: Data about the bills discussed within national parliament as well as votings on bills (not to be confused with passed national law). Data on bills must be available for the current legislation period.

Content of bill
Author of bill
Votes on bill per member of parliament
Transcripts of debates on bill
Status of the bill

National Laws: This data category requires all national laws and statutes available to be available online, although it is not a requirement that information on legislative behaviour e.g. voting records is available. To satisfy this category, the following minimum criteria must be met:

Content of the law / statutes
If applicable, all relevant amendments to the law
Date of last amendments
Data should be updated at least quarterly

Election Results: This data category requires results by constituency / district for all major national electoral contests. To satisfy this category, the following minimum criteria must be met:

Result for all major electoral contests
Number of registered votes
Number of invalid votes
Number of spoiled ballots
All data should be reported at the level of the polling station

National Map: This data category requires a high level national map. To satisfy this category, the following minimum criteria must be met:

Scale of 1:250,000 (1 cm = 2.5km).
Markings of national traffic routes
Markings of water stretches
Markings of relief/heights
National borders
Updated at least once a year.

Pollutant Emissions: Data about the daily mean concentration of air pollutants, especially those potentially harmful to human health. Data should be available for all air monitoring stations or air monitoring zones in a country.In order to satisfy the minimum requirements for this category, data must be available for the following pollutants and meet the following minimum criteria:

Particulate matter (PM) Levels
Sulfur oxides (SOx)
Nitrogen oxides (NOx)
Volatile organic compounds (VOCs)
Carbon monoxide (CO)
Ozone.
Available per air monitoring station/zone

Company Register: List of registered (limited liability) companies. The submissions in this data category do not need to include detailed financial data such as balance sheet, etc. To satisfy this category, the following minimum criteria must be met:

Name of company
Unique identifier of the company
Company address
Updated at least once a month

Location datasets: A database of postcodes/zipcodes and the corresponding spatial locations in terms of a latitude and a longitude (or similar coordinates in an openly published coordinate system). The data has to be available for the entire country. Data submitted in this category must satisfy the following minimum conditions:

Zipcodes
Address
Coordinate (latitude & longitude)
Data available for entire country

Administrative boundaries: Data on administrative units or areas defined for the purpose of administration by a (local) government.

Boundary level 1
Boundary level 2
Coordinates (latitude& longitude)
Name of polygons (department, region, city)
_Borders of polygons _

Procurement : All tenders and awards of the national/federal government aggregated by office. Monitoring tenders can help new groups to participate in tenders and increase government compliance. Data submitted in this category must be aggregated by office, updated at least monthly & satisfy the following minimum criteria:

Tenders per government office
Awards per government office
Tender name
Tender description
Tender status
Award title
Award description
Value of the award
Supplier's name

Water Quality : Data, measured at the water source, on the quality of water is essential for both the delivery of services and the prevention of diseases. In order to satisfy the minimum requirements for this category, data should be available on level of the following chemicals by water source and be updated at least weekly:

Fecal coliform
Arsenic
Fluoride levels
Nitrates
TDS (Total dissolved solids)

Weather Forecast: 5 day-forecasts of temperature, precipitation and wind. Forecasts have to be provided for several regions in the country. In order to satisfy the minimum requirements for this category, data submitted should meet the following criteria:

Temperature extremes
Temperature average
Wind speed
Wind direction
Precipitation Amount
Precipitation Probability
Forecast for current day and four following days

Land Ownership: Data should include maps of lands with parcel layer that displays boundaries in addition to a land registry with information on registered parcels of land. The following characteristics must be included in cadastral and registry information submitted

Parcel Boundaries
Parcel ID
Property Value
Tenure Type

Place

In a few cases, we have received submissions for places that are not officially recognised as independent countries; we have included these if they are complete and accurate submissions. Therefore, the Global Open Data Index 2016 ranks ‘Places’ and not ‘Countries’. Generally, we seek to survey jurisdictions with sufficient autonomy to be responsible for data management and publication. Usually these are countries; however, there are cases where country jurisdiction is disputed and we generally seek to be flexible and inclusive where we can.

Scoring

Each dataset in each place is evaluated using a set of questions that examine the openness of the datasets based to the open definition and the Open Data Charter.

In 2016, we introduced the new survey of the Global Open Data Index (GODI). The new scoring follows two major ideas:

We assume that each question of our survey measures a crucial characteristic of either the legal, technical and practical ‘openness’ of data. Our scoring follows an assessment of the weighting (see below) in which we describe why a question is important for open data and how a scoring can reflects this importance. We also explain cases why we should not score a question. With this approach, we aim to reduce the potential bias towards single aspects of openness.
The new scoring gives in total 40 points to open licenses/public domain status and machine-readable and open file formats. These technical and legal aspects of openness are the core of the Open Definition 2.1 and we seek to maintain a strong emphasis on them. However, aspects such timely publication, data availability and accessibility are equally important to access and use open data. Questions around data accessibility receive a score of in total 60 points.

Questions & Scoring

Section A: Background Information (Not Scored)

Rate your knowledge of the data category
Rate your knowledge of the principles of open data

Section B: About the Data (Scored)

Question	Description	Scoring
Are the data available online without the need to register or request access to the data?	Answer “Yes”, if the data are made available by the government on a public website. Answer “No” if the data are NOT available online or are available online only after registering, requesting the data from a civil servant via email, completing a contact form or another similar administrative process.	Score: 15
	Is the data available free of charge?	The data is free if you don’t have to pay for it.	Score: 15
	Is the data downloadable all at once?	Answer “Yes”, if you can download all data at once from the URL at which you found them. In case that downloadable data files are very large, their downloads may also be organised by month or year or broken down into sub-files. Answer “No” if if you have to do many manual steps to download the data, or if you can only retrieve very few parts of a large dataset at a time (for instance through a search interface).	Score: 15
	Data should be updated every [Time Interval]: Is the data up-to-date?	Please base your answer on the date at which you answer this question. Answer “No” if you cannot determine a date, or if the data are outdated.	Score: 15
	Is the data openly licensed/in public domain?	This question measures if anyone is legally allowed to use, modify and redistribute data for any purpose. Only then data is considered truly "open" (see Open Definition). Answer ”Yes” if the data are openly licensed. The Open Definition provides a list of conformant licenses. Answer also “Yes” if there is no open licence, but a statement that the dataset is in “public domain”. To count as public domain the dataset must not be protected by copyright, patents or similar restrictions. If you are not sure whether an open licence or public domain disclaimer is compliant with the Open Definition 2.1, seek feedback on the Open Data Index discussion forum.	Score: 20
	In which formats are the data?	Tell us the file formats of the data. We automatically compare them against a list of file formats that are considered machine-readable and open. A file format is called machine-readable if your computer can process, access, and modify single elements in a data file. The Index considers formats to be “open” if they can be fully processed with at least one free and open-source software tool. The source code of these format does not have to be open. Potentially these formats allow more people to use the data, because people do not need to buy specific software to open it.	Score: 20

Section B: About the Data (Not Scored)

Sample methodology

The Index uses a non-probability sampling technique — also known as a “snowball sample”. A snowball sample attempts to locate the subject of studies in areas that are hard to locate. In our case, we work with contributors who are interested in open government data activity who can assess the availability and quality of open datasets in their respective locations. We do so not only by using referrals, but also by reaching out on social media, through regular communications our Open Government Data and Open Data Index forums, and by actively networking at conferences and events. This year, we also hired local coordinators, that outreached to their networks and assist in soliciting new submissions. This means that anyone from any place can participate and contribute to the Global Open Data Index as a contributor and make submissions, which are then reviewed. We do not have a quota on the number of places that can participate. Rather, we aim to sample as many places around the world as we can. This also has an impact on the quality of the data we collected in the first stage of the Global Open Data Index. Contributors have diverse knowledge and backgrounds in open data and therefore they sometimes need help finding the data we are looking for. The following section explains how we tried to deal with this problem.