Microsoft Research Open Data


Microsoft Research Open Data: “… is a data repository that makes available datasets that researchers at Microsoft have created and published in conjunction with their research. You can browse available datasets and either download them or directly copy them to an Azure-based Virtual Machine or Data Science Virtual Machine. To the extent possible, we follow FAIR (findable, accessible, interoperable and reusable) data principles and will continue to push towards the highest standards for data sharing. We recognize that there are dozens of data repositories already in use by researchers and expect that the capabilities of this repository will augment existing efforts. Datasets are categorized by their primary research area. You can find links to research projects or publications with the dataset.

What is our goal?

Our goal is to provide a simple platform to Microsoft’s researchers and collaborators to share datasets and related research technologies and tools. The site has been designed to simplify access to these data sets, facilitate collaboration between researchers using cloud-based resources, and enable the reproducibility of research. We will continue to evolve and grow this repository and add features to it based on feedback from the community.

How did this project come to be?

Over the past few years, our team, based at Microsoft Research, has worked extensively with the research community to create cloud-based research infrastructure. We started this project as a prototype about a year ago and are excited to finally share it with the research community to support data-intensive research in the cloud. Because almost all research projects have a data component, there is real need for curated and meaningful datasets in the research community, not only in computer science but in interdisciplinary and domain sciences. We have now made several such datasets available for download or use directly on cloud infrastructure….(More)”.

The Global Council on Extended Intelligence


“The IEEE Standards Association (IEEE-SA) and the MIT Media Lab are joining forces to launch a global Council on Extended Intelligence (CXI) composed of individuals who agree on the following:

One of the most powerful narratives of modern times is the story of scientific and technological progress. While our future will undoubtedly be shaped by the use of existing and emerging technologies – in particular, of autonomous and intelligent systems (A/IS) – there is no guarantee that progress defined by “the next” is beneficial. Growth for humanity’s future should not be defined by reductionist ideas of speed or size alone but as the holistic evolution of our species in positive alignment with the environmental and other systems comprising the modern algorithmic world.

We believe all systems must be responsibly created to best utilize science and technology for tangible social and ethical progress. Individuals, businesses and communities involved in the development and deployment of autonomous and intelligent technologies should mitigate predictable risks at the inception and design phase and not as an afterthought. This will help ensure these systems are created in such a way that their outcomes are beneficial to society, culture and the environment.

Autonomous and intelligent technologies also need to be created via participatory design, where systems thinking can help us avoid repeating past failures stemming from attempts to control and govern the complex-adaptive systems we are part of. Responsible living with or in the systems we are part of requires an awareness of the constrictive paradigms we operate in today. Our future practices will be shaped by our individual and collective imaginations and by the stories we tell about who we are and what we desire, for ourselves and the societies in which we live.

These stories must move beyond the “us versus them” media mentality pitting humans against machines. Autonomous and intelligent technologies have the potential to enhance our personal and social skills; they are much more fully integrated and less discrete than the term “artificial intelligence” implies. And while this process may enlarge our cognitive intelligence or make certain individuals or groups more powerful, it does not necessarily make our systems more stable or socially beneficial.

We cannot create sound governance for autonomous and intelligent systems in the Algorithmic Age while utilizing reductionist methodologies. By proliferating the ideals of responsible participant design, data symmetry and metrics of economic prosperity prioritizing people and the planet over profit and productivity, The Council on Extended Intelligence will work to transform reductionist thinking of the past to prepare for a flourishing future.

Three Priority Areas to Fulfill Our Vision

1 – Build a new narrative for intelligent and autonomous technologies inspired by principles of systems dynamics and design.

“Extended Intelligence” is based on the hypothesis that intelligence, ideas, analysis and action are not formed in any one individual collection of neurons or code…..

2 – Reclaim our digital identity in the algorithmic age

Business models based on tracking behavior and using outdated modes of consent are compounded by the appetites of states, industries and agencies for all data that may be gathered….

3 – Rethink our metrics for success

Although very widely used, concepts of exponential growth and productivity such as the gross domestic product (GDP) index are insufficient to holistically measure societal prosperity. … (More)”.

The Role of Behavioral Economics in Evidence-Based Policymaking


William J. Congdon and Maya Shankar in Special Issue of The ANNALS of the American Academy of Political and Social Science on Evidence Based Policy Making: “Behavioral economics has come to play an important role in evidence-based policymaking. In September 2015, President Obama signed an executive order directing federal agencies to incorporate insights from behavioral science into federal policies and programs. The order also charged the White House Social and Behavioral Sciences Team (SBST) with supporting this directive. In this article, we briefly trace the history of behavioral economics in public policy. We then turn to a discussion of what the SBST was, how it was built, and the lessons we draw from its experience and achievements. We conclude with a discussion of prospects for the future, arguing that even as SBST is currently lying fallow, behavioral economics continues to gain currency and show promise as an essential element of evidence-based policy….(More)”.

Mapping Puerto Rico’s Hurricane Migration With Mobile Phone Data


Martin Echenique and Luis Melgar at CityLab: “It is well known that the U.S. Census Bureau keeps track of state-to-state migration flows. But that’s not the case with Puerto Rico. Most of the publicly known numbers related to the post-Maria diaspora from the island to the continental U.S. were driven by estimates, and neither state nor federal institutions kept track of how many Puerto Ricans have left (or returned) after the storm ravaged the entire territory last September.

But Teralytics, a New York-based tech company with offices in Zurich and Singapore, has developed a map that reflects exactly how, when, and where Puerto Ricans have moved between August 2017 and February 2018. They did it by tracking data that was harvested from a sample of nearly 500,000 smartphones in partnership with one major undisclosed U.S. cell phone carrier….

The usefulness of this kind of geo-referenced data is clear in disaster relief efforts, especially when it comes to developing accurate emergency planning and determining when and where the affected population is moving.

“Generally speaking, people have their phones with them the entire time. This tells you where people are, where they’re going to, coming from, and movement patterns,” said Steven Bellovin, a computer science professor at Columbia University and former chief technologist for the U.S. Federal Trade Commission. “It could be very useful for disaster-relief efforts.”…(More)”.

Preprints: The What, The Why, The How.


Center for Open Science: “The use of preprint servers by scholarly communities is definitely on the rise. Many developments in the past year indicate that preprints will be a huge part of the research landscape. Developments with DOIs, changes in funder expectations, and the launch of many new services indicate that preprints will become much more pervasive and reach beyond the communities where they started.

From funding agencies that want to realize impact from their efforts sooner to researchers’ desire to disseminate their research more quickly, the growth of these servers and the number of works being shared, has been substantial. At COS, we already host twenty different organizations’ services via the OSF Preprints platform.

So what’s a preprint and what is it good for? A preprint is a manuscript submitted to a  dedicated repository (like OSF PreprintsPeerJbioRxiv or arXiv) prior to peer review and formal publication. Some of those repositories may also accept other types of research outputs, like working papers and posters or conference proceedings. Getting a preprint out there has a variety of benefits for authors other stakeholders in the research:

  • They increase the visibility of research, and sooner. While traditional papers can languish in the peer review process for months, even years, a preprint is live the minute it is submitted and moderated (if the service moderates). This means your work gets indexed by Google Scholar and Altmetric, and discovered by more relevant readers than ever before.
  • You can get feedback on your work and make improvements prior to journal submission. Many authors have publicly commented about the recommendations for improvements they’ve received on their preprint that strengthened their work and even led to finding new collaborators.
  • Papers with an accompanying preprint get cited 30% more often than papers without. This research from PeerJsums it up, but that’s a big benefit for scholars looking to get more visibility and impact from their efforts.
  • Preprints get a permanent DOI, which makes them part of the freely accessible scientific record forever. This means others can relay on that permanence when citing your work in their research. It also means that your idea, developed by you, has a “stake in the ground” where potential scooping and intellectual theft are concerned.

So, preprints can really help lubricate scientific progress. But there are some things to keep in mind before you post. Usually, you can’t post a preprint of an article that’s already been submitted to a journal for peer review. Policies among journals vary widely, so it’s important to check with the journal you’re interested in sending your paper to BEFORE you submit a preprint that might later be published. A good resource for doing this is JISC’s SHERPA/RoMEO database. It’s also a good idea to understand the licensing choices available. At OSF Preprints, we recommend the CC-BY license suite, but you can check choosealicense.com or https://osf.io/6uupa/ for good overviews on how best to license your submissions….(More)”.

Data Ethics Framework


Introduction by Matt Hancock MP, Secretary of State for Digital, Culture, Media and Sport to the UK’s Data Ethics Framework: “Making better use of data offers huge benefits, in helping us provide the best possible services to the people we serve.

However, all new opportunities present new challenges. The pace of technology is changing so fast that we need to make sure we are constantly adapting our codes and standards. Those of us in the public sector need to lead the way.

As we set out to develop our National Data Strategy, getting the ethics right, particularly in the delivery of public services, is critical. To do this, it is essential that we agree collective standards and ethical frameworks.

Ethics and innovation are not mutually exclusive. Thinking carefully about how we use our data can help us be better at innovating when we use it.

Our new Data Ethics Framework sets out clear principles for how data should be used in the public sector. It will help us maximise the value of data whilst also setting the highest standards for transparency and accountability when building or buying new data technology.

We have come a long way since we published the first version of the Data Science Ethical Framework. This new version focuses on the need for technology, policy and operational specialists to work together, so we can make the most of expertise from across disciplines.

We want to work with others to develop transparent standards for using new technology in the public sector, promoting innovation in a safe and ethical way.

This framework will build the confidence in public sector data use needed to underpin a strong digital economy. I am looking forward to working with all of you to put it into practice…. (More)”

The Data Ethics Framework principles

1.Start with clear user need and public benefit

2.Be aware of relevant legislation and codes of practice

3.Use data that is proportionate to the user need

4.Understand the limitations of the data

5.Ensure robust practices and work within your skillset

6.Make your work transparent and be accountable

7.Embed data use responsibly

The Data Ethics Workbook

The Slippery Math of Causation


Pradeep Mutalik for Quanta Magazine: “You often hear the admonition “correlation does not imply causation.” But what exactly is causation? Unlike correlation, which has a specific mathematical meaning, causation is a slippery concept that has been debated by philosophers for millennia. It seems to get conflated with our intuitions or preconceived notions about what it means to cause something to happen. One common-sense definition might be to say that causation is what connects one prior process or agent — the cause — with another process or state — the effect. This seems reasonable, except that it is useful only when the cause is a single factor, and the connection is clear. But reality is rarely so simple.

Although we tend to credit or blame things on a single major cause, in nature and in science there are almost always multiple factors that have to be exactly right for an event to take place. For example, we might attribute a forest fire to the carelessly thrown cigarette butt, but what about the grassy tract leading to the forest, the dryness of the vegetation, the direction of the wind and so on? All of these factors had to be exactly right for the fire to start. Even though many tossed cigarette butts don’t start fires, we zero in on human actions as causes, ignoring other possibilities, such as sparks from branches rubbing together or lightning strikes, or acts of omission, such as failing to trim the grassy path short of the forest. And we tend to focus on things that can be manipulated: We overlook the direction of the wind because it is not something we can control. Our scientifically incomplete intuitive model of causality is nevertheless very useful in practice, and helps us execute remedial actions when causes are clearly defined. In fact, artificial intelligence pioneer Judea Pearl has published a new book about why it is necessary to teach cause and effect to intelligent machines.

However, clearly defined causes may not always exist. Complex, interdependent multifactorial causes arise often in nature and therefore in science. Most scientific disciplines focus on different aspects of causality in a simplified manner. Physicists may talk about causal influences being unable to propagate faster than the speed of light, while evolutionary biologists may discuss proximate and ultimate causes as mentioned in our previous puzzle on triangulation and motion sickness. But such simple situations are rare, especially in biology and the so-called “softer” sciences. In the world of genetics, the complex multifactorial nature of causality was highlighted in a recent Quanta article by Veronique Greenwood that described the intertwined effects of genes.

One well-known approach to understanding causality is to separate it into two types: necessary and sufficient….(More)”

The Researcher Passport: Improving Data Access and Confidentiality Protection


Report by Margaret C. Levenstein, Allison R.B. Tyler, and Johanna Davidson Bleckman: “Research and evidence-building benefit from the increased availability of administrative datasets, linkage across datasets, detailed geospatial data, and other confidential data. Systems and policies for provisioning access to confidential data, however, have not kept pace and indeed restrict and unnecessarily encumber leading-edge science.

One series of roadblocks can be smoothed or removed by establishing a common understanding of what constitutes different levels of data sensitivity and risk as well as minimum researcher criteria for data access within these levels. This report presents the results of a recently completed study of 23 data repositories.

It describes the extant landscape of policies, procedures, practices, and norms for restricted data access and identifies the significant challenges faced by researchers interested in accessing and analyzing restricted use datasets.

It identifies commonalities among these repositories to articulate shared community standards that can be the basis of a community-normed researcher passport: a credential that identifies a trusted researcher to multiple repositories and other data custodians.

Three main developments are recommended.

First, language harmonization: establishing a common set of terms and definitions – that will evolve over time through collaboration within the research community – will allow different repositories to understand and integrate shared standards and technologies into their own processes.

Second: develop a researcher passport, a durable and transferable digital identifier issued by a central, community-recognized data steward. This passport will capture researcher attributes that emerged as common elements of user access requirements across repositories, including training, and verification of those attributes (e.g., academic degrees, institutional affiliation, citizenship status, and country of residence).

Third: data custodians issue visas that grant a passport holder access to particular datasets for a particular project for a specific period of time. Like stamps on a passport, these visas provide a history of a researcher’s access to restricted data. This history is integrated into the researcher’s credential, establishing the researcher’s reputation as a trusted data steward….(More)

Skills for a Lifetime


Nate Silver’s commencement address at Kenyon College: “….Power has shifted toward people and companies with a lot of proficiency in data science.

I obviously don’t think that’s entirely a bad thing. But it’s by no means entirely a good thing, either. You should still inherently harbor some suspicion of big, powerful institutions and their potentially self-serving and short-sighted motivations. Companies and governments that are capable of using data in powerful ways are also capable of abusing it.

What worries me the most, especially at companies like Facebook and at other Silicon Valley behemoths, is the idea that using data science allows one to remove human judgment from the equation. For instance, in announcing a recent change to Facebook’s News Feed algorithm, Mark Zuckerberg claimed that Facebook was not “comfortable” trying to come up with a way to determine which news organizations were most trustworthy; rather, the “most objective” solution was to have readers vote on trustworthiness instead. Maybe this is a good idea and maybe it isn’t — but what bothered me was in the notion that Facebook could avoid responsibility for its algorithm by outsourcing the judgment to its readers.

I also worry about this attitude when I hear people use terms such as “artificial intelligence” and “machine learning” (instead of simpler terms like “computer program”). Phrases like “machine learning” appeal to people’s notion of a push-button solution — meaning, push a button, and the computer does all your thinking for you, no human judgment required.

But the reality is that working with data requires lots of judgment. First, it requires critical judgment — and experience — when drawing inferences from data. And second, it requires moral judgment in deciding what your goals are and in establishing boundaries for your work.

Let’s talk about that first type of judgment — critical judgment. The more experience you have in working with different data sets, the more you’ll realize that the correct interpretation of the data is rarely obvious, and that the obvious-seeming interpretation isn’t always correct. Sometimes changing a single assumption or a single line of code can radically change your conclusion. In the 2016 U.S. presidential election, for instance, there were a series of models that all used almost exactly the same inputs — but they ranged in giving Trump as high as roughly a one-in-three chance of winning the presidency (that was FiveThirtyEight’s model) to as low as one chance in 100, based on fairly subtle aspects of how each algorithm was designed….(More)”.

Policy experimentation: core concepts, political dynamics, governance and impacts


Article by Dave Huitema, Andrew Jordan, Stefania Munaretto and Mikael Hildén in Policy Sciences: “In the last two decades, many areas of the social sciences have embraced an ‘experimentalist turn’. It is well known for instance that experiments are a key ingredient in the emergence of behavioral economics, but they are also increasingly popular in sociology, political science, planning, and in architecture (see McDermott 2002). It seems that the potential advantages of experiments are better appreciated today than they were in the past.

But the turn towards experimentalism is not without its critics. In her passionate plea for more experimentation in political science for instance, McDermott (2002: 42) observes how many political scientists are hesitant: they are more interested in large-scale multiple regression work, lack training in experimentation, do not see how experiments could fit into a broader research strategy, and alternative movements in political science (such as constructivists and postmodernists) consider that experimental work is not able to capture complexities and nuances. Representing some of these criticisms, Howe (2004) suggests that experimentation is being oversold and highlights various complications, especially the trade-offs that exist between internal and external validity, the fact that causal inferences can be generated using many other research methods, and the difficulty of comparing governance interventions to new medications in medicine….(More)”.