The Story Rules Podcast E15: Crafting Data Narratives with Pramit Bhattacharya (Transcript)

This transcript has been created using a combination of AI transcription tools and (some painstaking) human effort. Please excuse any typos, grammatical mistakes, inaccurate time stamps, or other errors. Specifically, the time stamps would not account for the intro portion of the podcast.

You may share portions of the transcript with due credit. Enjoy!

And, if you find this content useful, it would be great if you could leave a review on your preferred podcast platform.

Intro hook:

“No data exists on its own; there has to be a story, or a theory, or a hypothesis which connects that and everything else that we know about the world or the subject matter of your story. When you say that these two states in northern India have developed, is it reflected in the GDP per capita numbers? Is it reflected in the unemployment numbers, that government data itself is collecting? Is it reflected in the other development indicators?”

Welcome to the Story Rules podcast with me, Ravishankar Iyer, where we learn from some of the best storytellers in the world, find their story and unearth the secrets of their craft.

Today we speak with Pramit Bhattacharya, the ex-Data Editor at the Mint newspaper and currently a freelance data journalist based in Chennai.

Data is the core raw material with which we build a story. If the quality of your sand or clay is rubbish, then the bricks, and the house that you build with it, will also be rubbish.

Pramit has spent several years tracking macro data in India – from various government and institutional sources. He deeply understands the storied history of India’s statistical infrastructure as well as some of the recent troubling developments in that space.

Over the years, he has written several detailed pieces arguing for what needs to be done to improve our data foundation.

He has also written data-driven investigative pieces which have shed more light on key sectors of the economy.

In this conversation, Pramit shares some of the techniques that we can all use when working with data:

He talks about the importance of validating your data.
- How do you know what data sources (especially from the government) to trust?
- How the use of transparency – especially concerning the raw-data and collection methodology – can engender trust in data?
- And How you can check the credibility of one metric by triangulating data from other related sources?
Pramit also shares his technique to not get swayed by his own hypotheses and biases when investigating a data story
Finally, he talks about the importance of the counter-factual – an key technique to ensure that we don’t get too swayed by alarmist headlines. By asking ourselves – Ok, X looks bad, but what is the counter factual? What is the norm for a similar context? – we can be better placed to come to an informed judgement about X.

It’s a conversation filled with practical nuggets of wisdom that you can use to improve your own data stories.

Unfortunately we had to cut this conversation a bit short because of an unforeseen commitment that he had. I definitely hope to continue my conversation with Pramit sometime in the future!

With that, let’s dive in.

Ravi (3:00)
Hi, Pramit. Welcome to the Story Rules podcast!

Pramit (3:04)

Thanks for having me, Ravi.

Ravi (3:06)
Wonderful. Pramit, you’ve grown up in Assam, you graduated from the famed Cotton College there; you have gone ahead and done your post-grad from the Indira Gandhi Institute of Development Research thing that I find quite remarkable about you, Pramit, is that I’ve heard the very detailed, lovely, deep conversation that you’ve had with Amit Varma – which I’ll link in the show notes here – and the thing that I find remarkable is the fairly high sense of self-awareness (you have) including what you want to do, what you don’t want to do; what kind of work you don’t want to do, at a fairly young age. It might seem normal to you, but coming from my world, (though I had) a lot of friends, it wasn’t very high, in my case at least. It was a whole lot of following the herd, saying “Ok, people are doing CA. Let’s do CA; people are working after CA, let’s work after CA; people are doing MBA, let’s do MBA.” And that’s how things went. “People are taking consulting jobs, let’s take consulting jobs.”
In your case, I remember one specific instance where you mentioned that you actually tried out working at a large foreign bank, and then you realised, “Ok, this is not for me.” And then you went into writing and journalism. So, (what I’d like to ask is) did you have any source or inspiration, or people from the family or outside, or just the kind of stuff you were reading, that helped you to get this self-awareness about what you want in life earlier?

Pramit (9:53)
That’s really hard to conclusively answer. Although you say that I have high self-awareness, I may not have the precise answers to everything around that. But I feel I had a strong rebellious streak, since childhood. Which has come down over time; I have mellowed now. When I started off, the people around me were preparing for engineering and medical (degrees), as was usual in most middle-class places (families). The place I come from – Tezpur, in Assam – is known as a cultural centre, as well as a sort of educational centre, in Assam. So, who got high ranks or marks in the state (was a topic that) would be discussed within town for several days; My brother, for instance, stood 9^th in class 12^th and people remembered that for years. There, the usual choices were engineering and medical. Somehow, I felt that I’m not going to do that.

Ravi (11:05)
What do you think was the source of that rebellious streak?

Pramit (11:10)
Maybe an authoritarian father? My father was a maths teacher, and quite strict. That could have provoked. Other than that, it’s really hard to (pinpoint a cause).

Ravi (11:19)
You’re the younger one, you said?

Pramit (11:20)
I am the younger one, yes.

Ravi (11:22)
Maybe you didn’t get as much pressure as your elder sibling.

Pramit (11:26)
Yeah, definitely. The other thing was that (in people’s families,) everyone’s desire was to get into IIT. But in my family, the desire was to get into ISI. Once my brother cracked that, he went to ISI and studied there for 5 years. Then he took the pressure off me. That’s why my life became what it was. Otherwise, I think I would have spent far more hours studying maths and preparing for that, and so on.

Ravi (12:04)
Fascinating, right? These accidents of when you’re born, and not just where you’re born in the family also have some sort of an impact (on your life).
Let’s talk about your topic of expertise, which is data. A lot of organizations call themselves ‘data-driven’, and there are famous quotes like “In God We Trust. All others must bring data.” Or so on; a lot of organizations say that very easily, but to actually be data-driven is not that easy, because there are so many cognitive biases; our minds are always made to say, “I know this, I just need to find some data to back it up.” And I remember you sharing a lovely story of when you realized the power of data. You may or may not have realized it earlier, but you talk about the story of the GM cotton seeds. Maybe you can share that story, and talk about the importance of data for policies, of course, and for decision making in the organizational sector, overall.

Pramit (13:50)
I always believed in the power of statistics. In fact, it was my one of my favourite subjects throughout, and I always believed that you could tell a better story even with very basic statistics, rather than without. Even in the market report, I used that. But I was also wary of how people abused statistics. Especially in my graduation years at IGIDR, I saw how you could manipulate it because we had this very heavy econometrics-centric course, where you ran regressions and so on, and after a point I, myself, learned how to perform simulations and see. It was an endless process and you can torture data to reveal (a lot of information).
I was a bit wary and didn’t arrive at that balance of knowing how to do it properly, ethically, so on, until much later. The GM Cotton story you mentioned was (related to) one of those factors, there are others as well. I started out with this presumption that GM cotton is responsible for farmer suicides, because that was the dominant narrative that people on the ground like some activists, some farmer-leaders, from Vidarbha in Maharashtra, were saying. My first visit was to Vidarbha.
First of all, the ground reality when I spoke to a wider cross-section, (showed that) just going from a sample point of N = 5, which was my phone conversations; to N = 35 or 40, after the one or three week long – I don’t remember the exact duration that I was there for – that itself gave me many more angles and insights, and suggested the story was much more nuanced. And that the seed is just one part of the entire farm ecosystem; there were many other factors that came into play.

Pramit (17:46)
When you travel and meet larger sets of people at different levels, you begin to hear more nuances in the story; certain contradictory stuff. You also begin to contextualize what you had heard earlier, in a more meaningful manner. I began to understand that the seed part of the entire farm ecosystem story was just one small part of it. There were many other factors, ranging from water, to credit, to other institutional structures such agricultural extension programs, and so on; which was getting lost in this over-arching single point narrative around Bt cotton and the seeds. There were, in fact, contradictory voices about the seeds from farmers themselves, saying that it had benefited them, at least in the early years. An ill-said reason before plateauing, at the time I visited.
Then, I came back and looked at the state-wise data, and I saw that in Maharashtra, cotton mills had gone up, post-introduction of Bt, but not dramatically. But in Gujarat, they had indeed gone up dramatically. So, my next visit was to Gujarat; the Saurashtra region, which is the cotton producing region, mainstay of farming. (inaudible)

Ravi (19:12)
This data, was it fairly easy to get? Data on yield of cotton, pre and post the introduction of Bt cotton? Because that’s a crucial step that you took.

Pramit (19:24)
Yeah, that data is a public data set. I don’t remember the exact site, but I can tell you later.

Ravi (19:34)
No, the point is not about the site, but what I’m trying to say that if somebody had to test this hypothesis, all they had to do is do what you did, right? How much time would it have taken for you to get those two pieces of data for Maharashtra and Gujarat?

Pramit (19:48)
Not much, frankly. It was fairly easy. I had already done some stories on agriculture, so I was familiar with the data.

Ravi (19:54)
It’s interesting sometimes (how) to refute or cross-check a narrative, it’s not too difficult to get some basic data.

Pramit (20:02)
Or you speak to the agricultural economists, saying, “Is this data available? Where can I find it?” and you know which year Bt was introduced, so that is public knowledge.

Ravi (20:11)
Fascinating. So, you went to Gujarat.

Pramit (20:15)
Yeah. I spoke to people there and I came across a completely different story. I began reading up on Gujarat so I was able to locate local seed producers, who had played a role in introducing a version of Bt cotton which was similar to Monsanto, but they had not paid any copyright, any patent fees or licensing, so it was some kind of quasi-illegal seed, but it was very popular. Even at the time when I was visiting, one could get those packets; I saw those packets in the hands of farmers. The other advantage was you didn’t have to pay in advance, unlike in Vidarbha where you had to pay upfront for the Bt seeds. Here, you paid depending on whether the crop performed or not, because the farmer would say, “Hey, these are not guaranteed.” So, if the crop fails, you don’t have to pay. If it works, you have to pay a certain amount which is much less than the others. So, a pricing power difference was there.
Beyond all that, the availability of water – both because of Narmada water coming in, and the local water harvesting efforts with various structures created over the past decade, not just because of governmental efforts but also because of non-governmental interventions; and finally, the system of extension workers was also superior, they had more people there to help in the farms. And also, credit: it was much more easily available. Gujarat has a long history of co-operative trade cultures, which is also there in parts of Maharashtra, but not in Vidarbha. You can find it in western Maharashtra and some parts of Konkan, but not much in Vidarbha. In Gujarat, there is a fairly widespread and much more superior trade network, throughout the state.
In fact, one of the things that people don’t talk about (related to) the Gujarat model, which I did, for my writing, was that it was more inclusive because of the farm story, and cotton was also a big part of it. It was not industrial growth. Pre and post Modi, you won’t find a structural break in the industrial growth, but you will find a structural break in agricultural growth. You will find some shift, and it is not just because of one person, obviously – there is a host of institutional factors, as I described, including water, history of credit culture, so on.
The final story I wrote was much more nuanced. I started with this anti-Monsanto bias, but by the time I wrote it, I ended up being biased against all the NGOs that were sprouting all these narratives about Monsanto. And as I said in the other podcast, which I should clarify here as well, this does not mean that all NGOs I’ve encountered in my reporting are like that. There are many NGOs doing good work, and many NGOs have helped me uncover the truth. But, at the same time, there are pockets; and this was one pocket where they had created a huge amount of misinformation around Bt cotton and genetically modified crops. So much so, that even the parliamentary standing committee on genetically modified crops leaned in their favor. Later, I realized that this was an international trend. I spoke to political scientists who studied these NGOs and their workings, and many of them were linked to the organic lobby, who wanted to promote their own products and denounce GM crops as somehow very harmful for your health so people wouldn’t support them and in turn, the demand for their alternatives went up. It was fascinating, but it was also a very challenging process to sift through all these claims and counter-claims, because the internet was full of all kinds of shitty stories around GM crops. In fact, I had sent a 15-point questionnaire to Monsanto, and realized later that 10-12 of those questions were absolutely meaningless. It was about some previous avatar of Monsanto that had some role in the Vietnam war, and I asked them questions about even that. Thinking back, I think they had a lot of patience with me. Initially, we got off on the wrong foot, because they wanted to influence the coverage of my piece. But later on, I came to appreciate their patience in answering a lot of stupid questions also, on my part. I think it was a combination of all that. And the NGOs initially helped me, but were not so forthcoming with answers to certain critical questions that I had later on. I think that the difference in the level of transparency also shaped my views.

Ravi (25:32)
The fascinating part in this whole thing, Pramit, is that it’s a great lesson or a great case study (stating) to not fall for simple narratives, howsoever appealing they may be, and it’s not easy to look really deeply but some parts of it were not too difficult either. Just ask somebody from the agricultural field, they’ll point you to some data, and that at least arms you with better questions to ask when you’re pushing into this.
I want to go deeper into some of the techniques you have used in this, because they have applications in other elements of storytelling also. But staying on the importance of data, in today’s times of so much misinformation and preconceived notions, camps, etc., do you still believe in the power of changing people’s minds with the right data? Or are you now a little skeptical, like “It doesn’t really matter, Ravi. There’s too many cross narratives going on, and the data doesn’t really help too much.”

Pramit (26:42)
I think I still believe in it. With the caveat that we should not expect revolutionary things. Data is not going to bring about a revolution, and we should be skeptical of anyone who says that anything can bring about a revolution. At the margin – does it make a difference? I think so. There will always be people who have pre-decided what ideology they’re going to support, economics, politics, etc. There’s very little you can do about that. But there is also a very large middle ground of thinking people, smart leaders, smart organizers, who do not have a pre-decided mindset on most things. And it is to them, largely, that you talk with. You have to convince and persuade them, but to do that you have to first convince yourself. Data plays a huge role in that, “I have to be convinced of my story first. Then, I can convince the other person.” In both these processes, data plays (a role).
To give a recent example, 2 of the stories that I liked about the entire controversy around schoolgirls wearing Hijabs in Karnataka were both data pieces. One was by Roshan Kishore, in the Hindustan Times, and one was by Tauseef and Pragya, in Mint – Plain Facts; The first story by Roshan showed how patriarchy in various forms exists in all religions, and some form of clothing restrictions are there, Hijab is just one (example of it.)
The second story talked about schooling outcomes, and how there is a similar enrolment rate between Muslim girls and other girls in the early stages, but as they move up higher classes, Muslim girls tend to drop out more. And for many Muslim girls, wearing their Hijab may be a way of negotiating with their families, saying that “I’m not questioning your culture or family traditions, or anything else. Let me also do this. I’ll wear the Hijab and respect everything else that you say, but let me also go to school or college.” It was a combination of data and reporting; they also spoke to people. I think putting the data in that, A – suggested that someone was trying to look at it dispassionately. I’m not saying that all data analysis is necessarily dispassionate or there are no biases, but it helps in some very contested spaces, and this is one of them. You can use data effectively to lower the temperature a bit, and bring out the nuances.
I’m giving examples where I was not involved; I was not involved in any of these stories, I’m speaking just as a reader. I’ve read other stories on Hijab, personal accounts, and I’m not saying those were completely (lacking in nuance), but those stories spoke a lot, at least to a reader like me.

Ravi (30:17)
Totally agree, Pramit. I really like the point that you made about the massive middle. Unfortunately, what happens is that the people at the margins disproportionately own the airwaves, so they make a lot more noise, and you have a sense or feeling that everybody is either here or there. But a lot of people who are there in the middle are silent. They’re the silent majority. It’s always good to remind yourself that it is making a difference; it’s not being seen, but you should continue to work. I really liked that point.
I did read Roshan’s piece. It was so eye-opening, more than anything else. And I, myself, was a classic “middle-person” here; I didn’t have a point of view either which way. Obviously, reading an opinion piece would make me think, “Oh, I think this is right.” “No, I think this is right.”
But looking at this I just realized, “Oh, it’s complicated. There are no easy answers here.” Sometimes, I think data can just do that job very well. That’s a great point.
Now, imagine if I come across data or information from a third party. There is data, and there is data. Quality varies; comprehensiveness varies – whether you’re covering everything, or you’re selectively cherry-picking it. All of those challenges are there. I want to talk about some techniques that you use in validating whether the data is good or not, can we actually use it? Can we rely on it? How credible is it?
You had mentioned a couple of points in an earlier conversation, I want to talk about those and maybe add a few others which I noticed you mentioning in your pieces. One interesting and simple point that you make is: go to the original source. If there is some Niti Ayog report that says “this data is from World Bank”, do not trust that the data is from World Bank. Go to the original World Bank report, make sure they actually have it there, and if that is reporting from somewhere else then go to that original source. So, go to the source. Go to the Gangotri. I think that’s a great, simple point.
The second point you make is – and this was new to me, because I came from a consulting background where we would take anything the Government publishes at face-value. We went, “Ok, Ministry of Health is saying this? Ok, take it. Don’t question that.” And what you made me realize, in your conversation, was that there are administrative sources of data and there are independent sources of data. ICDS is administrative, versus the Statistics Ministry, or those (other organizations that are independent).

Pramit (32:45)
NFHS.

Ravi (32:46)
NFHS is independent, yes. Maybe you could talk about that. Do you have, in your own mind, some sort of hierarchy of credibility for data that comes out of India? Something like “Coming from here, number 1 or number 2”.

Pramit (33:04)
Certainly. What you just said about trusting any data that comes from the government was also standard in journalism till very recently. That was the standard view even at the point when I entered journalism. There were some questions on some data sets, it was not like the 1950s and 60s where everything the PIB gives you would be published uncritically. But as far as statistics was concerned, the gold standard was still seen as anything that was government of India stamped. I think this questioning came primarily from economic or financial journalists, who began to notice there are differences between different data sets which should ideally be pointing to the same direction if the economy is moving in a certain way, but varied. And folks from RBI would sometimes highlight certain discrepancies; folks from equity markets who were tracking it would highlight some; and some of us journalists would report it. I also started out as a market reporter, so it was there that I first began tackling these questions. I remember the first data story I did was on IIP, and how the production of insulated cables was driving the growth up and down, or something like that. I don’t remember. I think it was insulated cables, and then one of my colleagues wrote about antacids. It was the production of antacids that went up 250% or so and IIP went up… I mean in once sense, it was hilarious. But when you spoke to people, they were already saying that IIP is junk, we are not (going to bother with it.) And these were all investors based in Singapore and so on. It is just that these views were not being expressed publicly. Privately, people like institutional investors (were talking about this.)

Ravi (35:09)
They had a hierarchy in their mind.

Pramit (35:11)
Yeah, they were already questioning some of the unofficial data sets. Some of us tried to bring that into our journalism and work, so it was some kind of middle thing (ground). Then, I think the big shift happened with the new GDP series which came around 2015. Unfortunately, a lot of that debate got politicized as if Modi himself came and dictated that this will be the GDP number. I don’t think we have any evidence for that, I at least don’t. But the point is that it was an institutional shift. There was a whole over-arching decay in India’s statistics gathering operations over time, which peaked in this period because we were making a lot of changes for which we were not ready. You need a lot of inputs coming in, which you didn’t have, so it was a garbage-in-garbage-out kind of thing, so a lot of your outputs became faulty because you didn’t have the right inputs. Price deflators, for instance. You didn’t have those, so you put in whatever you had which was wholesale price index, which is what most countries (do).

Ravi (36:19)
Can you explain what is a price deflator for the audience?

Pramit (36:22)
Suppose, today, the shirt I’m wearing costs 1,000 Rupees. Five years down the line, the same shirt, same material, same everything will cost more. Let’s say 1,500. When you look at the production of all shirts, suppose there are 10 shirts being produced. 10 into 1,000 is 10,000 Rupees. 5 years down the line, 10 into 1,500 is 15,000. That jump in the value of total production of this particular shirt of 5,000 Rupees has to just be seen as a price differential. It is not a value change.

Ravi (37:16)
That does not mean more shirts produced.

Pramit (37:18)
Yes. Economists typically use some kind of inflation index to adjust for this difference. If, instead of 15,000, the total value of shirts is 30,000, then you know that 5,000 of this difference is the price change; the rest 15,000 is the production difference. That is the actual volume which has gone up, and that should go into your GDP and not the 5,000 which is because of price.

Ravi (37:50)
Got it.

Pramit (37:51)
So you separate out the 15,000 and the 5,000. To do that, you need deflators for each item in the current series. Earlier, it was different. A lot of production data was being captured as volumes; you did not have to do this splicing of price and volume.

Ravi (38:08)
In the new series, instead of volume, they took actual output value.

Pramit (38:14)
Yeah, the (output) value and then deflated it. The explanation of this was also not done very well. Even for someone like me, it took a long time to process this and figure out exactly what was happening. Maybe one of the reasons why they didn’t explain it to well was because they would have to explain to people that we don’t have good, high quality price data, and then people would ask “Why don’t you have high quality price data?” and they’d have to say, “We don’t have those kinds of surveys.” Then people would ask, “Why don’t you have those kinds of surveys?”, then they say, “A – we don’t have the budget; B – we don’t have enough –

Pramit (38:57)
B – we don’t have enough research that has gone into this to be able to roll out such series.” You’d get into discussions of a statistical nature, which would also demand much more from the statistical system. So, to pre-empt all that, you give a very unconvincing but administratively convenient explanation. And because people are also unconvinced, they go for other explanations, such as political explanations of why this happened. Unfortunately, the debate moves on to a different level altogether. But what it did in one sense, which I think is not a wholly negative thing, is that it created a lot of skepticism around government data in general. If the core of your statistics, that is the GDP of your country, is inaccurate or not accurate enough, then what about everything else that we have been taking for granted all these years?
A lot of statisticians complain about this skepticism when they speak to me, and some of them also blame me and my colleagues for it. But I think, on balance, this is a good thing. While you need to have faith in the statistical establishments and their functioning, at the same time, you need to ask for accountability; you need to ask for verifiability; you need to ask for transparency, and you need them to be able to explain whatever administrative choices they make. That people are asking more and more of these questions is a healthy development.
In the extreme case, yes, there is a tail in the entire distribution which will question everything that the governments put out. But not everyone is just blindly dismissing it. And I’m sure that proportion will grow as more and more people learn statistics; more and more people use data in their lives in the world; as more and more people become aware of the role of uncertainty in decision making, in statistical estimates, in much of their lives. They will also realize that many of the estimates that are coming from the government will also be subject to the same uncertainty. That, I think, (enables us to) have much more mature discussions about errors; forecast errors; about how to improve the quality of the data that we are getting and using, and that is the way to go.

Ravi (41:45)
Which is a good thing. It’s a good way to look at this challenging period in the statistical architecture. Going back to the (scenario) that if I’m a user from a large corporation, or from an economist think tank, etc.; let’s say you’ve got independent data producers, which is the National Statistical Organization, or the RBI; then you’ve got the government department; then you’ve got affiliates – would that broadly be a hierarchy or what is it that you would still say (about it), despite whatever challenges are coming even in the NSO?

Pramit (42:23)
Broadly, you’re right. Let me talk about it in the way that we get data or see data. A lot of our emails get flooded with random – no, not random surveys, arbit surveys, done by various organizations, because random is a statistical term and surveys should actually be randomized. But these aren’t randomized, these are surveys of 100 people done at some mall, or done online without any proper methodology. Most of those surveys that you encounter, especially in the private sector, are bullshit surveys. There is no other way to describe it. Within the private sector, I would still say, despite my reservation around CMIE’s Consumer Pyramid survey, that they still do a better job than many other household services. There’s CMIE and another organization called PRICE, which does an ICE 360 survey on consumer markets and so on. They have a fairly large sample; they have a certain methodology. I still think both of those surveys have an urban bias. I have not looked at CMIE very closely, but to the extent I have (this is what I think.) I have looked at PRICE very closely, and while using PRICE results, I always caution readers that the rural findings should be taken with a pinch of salt.
Then you move onto the next step of government data. Which are usually more trustworthy than any private source. Within that there is, of course, a hierarchy. To figure out the hierarchy, you need to ask who is the data producer? What are the incentives facing that data producer? What are the quality checks institutionalized by that data producer or whoever is in charge? Think of something like a census: it happens every 10 years; The Registrar General of India – there’s a constitutionally mandated system of how it will be done, laid down during the founding moment of the constitution itself. There is a very clear structure out there; after every census they do a post enumeration survey to cross-check their findings. Before every census, there is a months-long training program where they train the enumerators who actually go out into the field. All of this is very publicly published; you get the details, and much of the detailed census data is available online. There is a fairly strong level of transparency, although we could do with more, we could always debate (over that). But it is my opinion that the census has some steps still to be taken to facilitate data transparency. The point is, A – they have a long history of producing credible data, they have a process and everything; and finally, they do not have any particular incentive to tilt the data in a particular direction. The Registrar General of India is not responsible for, say, meeting sanitation targets, or meeting (the targets for) how many schools or hospitals should be built – which is what you evaluate governments on. The data on toilets that the 2022 or 2023 census will throw up will be much more credible than the data from the dashboard that the sanitation ministry is going to throw up, because the sanitation ministry has an incentive to say that we are working really hard. It has nothing to do with this government or that government, it happens in every government. In fact, when the 2011 census came out, I had done a story on how millions of toilets have gone missing. It was not that millions of toilets have gone missing, it was just that – at the time, it was called the Nirmal Bharat Abhyaan, the predecessor of the Swacch Bharat Abhyaan – the NBA had said that so many toilets had already been constructed by 2011, but when the census results came out, it showed that in many states – particularly Uttar Pradesh, there were millions of (missing) toilets. There was an error; it was negative. Clearly, those were just on paper. I’m quite sure something similar – well, not sure, but I have a hunch that something similar could happen with Swachh Bharat, because the survey data has already told us. For instance, NFHS data are from surveys conducted by unknown ministries which have no incentive to hype up sanitation data. The NFHS survey on sanitation was actually delayed because of objections from the sanitation ministry. They got a preview of the survey, and this is a question we should all raise, whether these ministries should be getting previews or not, because this raises questions on the credibility of the data. Although MOSPI does not have any incentive to paint that data in any way, the very fact that it’s sharing this data in advance raises some questions. Because of that delay, questions were asked around it. The Chief Statistician, in this case, unfortunately, wrote an op-ed together with the scientists, questioning his own numbers. And then former statisticians questioned his narrative. It became unnecessarily political, in my view.
So, when the NFHS numbers came out later, after a gap of a couple of years, it showed that the NFHS were not truly off the mark. They were both pointing in the (same) direction, and all data sets showed that toilet penetration had increased. It was not that there was nothing, but the dramatic claims of there being open defecation free, or (achieving) 100% or 90% coverage, those were off the mark. It was much less than what the administrative department responsible for it was claiming. Within the government data, these are the questions you need to ask before you begin to use a certain data set: what is the purpose of that dashboard or the data set? Is it just to support a particular ministry, or just to show that a particular ministry is working? Or is it to provide an overview to a broad range of stakeholders, which includes people in the government also? A lot of people outside the sanitation industry also want to know what is happening. They don’t want to rely on what the sanitation ministry is telling them, either. This corrupted data creates problems even for policy makers, and they admit it. At least in private conversations everyone admits it.

Ravi (49:27)
Fascinating.

Pramit (49:28)
It’s a very open secret kind of thing. There is no hidden (fact) at all. Sometimes when people from the outside (of this circle) say, “Oh no, this is government data (so it must be credible),” I feel like saying, “Look, people within the government are questioning this. And you, sitting outside, are saying let us trust it?” That is ridiculous!

Ravi (49:50)
So fascinating. I think that’s a great way for laypeople to just see what is the name of the government agency (posting the data). It should not be just the Government of India, to see what their incentives are, if any. I like the independence part.
You also mentioned another factor in the piece where you covered the NFHS data, which talked about transparency, and how NFHS itself went through initial scepticism when it was launched because of the USAID connection, and how just being transparent and saying, “Guys, here is the data. Go, look at it, and test it yourself,” that’s another angle which can help you to increase credibility.

Pramit (50:31)
Yeah. Someone from NFHS told me that NSS did not release the unit-level data before NFHS came on the scene. Because NFHS started releasing unit-level data, it created discussions within NSS, that “we should probably be doing the same,” and it happened a few years later.

Ravi (50:52)
That’s another great, important factor.
A couple of other things that you have written about, in terms of just testing the validity of a data set, is don’t rely on data from just one source. Howsoever sexy that source might be, and this recent instance that you talk about is the economic survey and the nightlights data, which became a huge talking point in this economic survey about how in North India, especially UP and Bihar, when they compared satellite data on nightlights, they found far more areas of illumination in UP and Bihar in 2011 vs ‘19, was it? I forget the two time periods they covered.

Pramit (51:33)
2012 and ’21, if I’m not mistaken.

Ravi (51:35)
’12 and ’21.
There’s a lot of allure (to say) that, “There’s this one source that’s really supporting the point I want to make, so let me run with it,” but no.
So, triangulation – looking at multiple data points, maybe you can talk about how one can do that. Maybe you can talk about this specific incidence, and in general, how is it that you can cross-check or verify data from one source with other related but different sources?

Pramit (52:00)
Yes. I’ve written this recently. I’m not questioning the use of new sources of evidence, I myself have used many new sources which people, previously, did not use for those purposes, and it’s absolutely fine. It’s just that you have to be careful in doing it. So even if you want to use satellite images, A – you have to be careful; B – you have to tell the user what the steps you took in being careful are. Don’t assume they’ll trust you on your word.
In this case, the first problem was that there was no transparency. And this was a public document. A team of government officials spent months in preparing the economic survey. It’s ultimately our money, taxpayer money, which is funding this. So, the least they can do is clearly tell us what they have done. They did not do that. There is no legend in that map, if you look at the document; basic details were missing. And then when you get that raw satellite data in pixelated form, it doesn’t come in the form of maps. Someone will have to process it, make certain assumptions; even simple things like cloud cover – different parts of India will have different cloud covers at different points. For instance, the monsoon season in South India functions differently from the rest of the country. Even now in Chennai, it’s the fag end of a second monsoon, which is not there in other parts of the country. So, it’s cloudy today.
So, how do you adjust for those things? You need to make certain assumptions. You do it over a period of time, of course; you take rolling data, and so on. You will have (different data) for the same city on different days, with clouds and without clouds, and then you will do some adjustments. Then there is the question of threshold: in the map, it is shown either as lighted or dark; there is no in-between, it is very hard to show the in-between. But in reality, much of urban and rural India is in-between. If you take too low a threshold, then it is the areas which are least developed that will also come in as very lighted. All of these parameters and adjustments need to be made, of course, but need to be communicated as well. That is very important.
No data exists on its own; there has to be a story, or a theory, or a hypothesis which connects that and everything else that we know about the world or the subject matter of your story. When you say that these two states in northern India have developed, is it reflected in the GDP per capita numbers? Is it reflected in the unemployment numbers, that government data itself is collecting? Is it reflected in the other development indicators? For which there are multiple surveys that they could do. So, some kind of comparative analysis (is needed), and that, I think, is useful in any number of fields. When we were students of the Indian economy, a professor told us that if one particular variable or metric is pointing in one direction, and nine others are pointing in another direction, then be very, very careful of that one metric. The idea is to look at a range of metrics that’s as far broad as possible, and to some extent, (all surveyors) are already doing that, it’s just that they don’t apply it everywhere. For instance, in the economic survey itself, they showed how they are looking at a dashboard of high-frequency indicators to monitor monthly (growth), and that’s a good thing. They should not be looking at one indicator. RBI also does that, and it mentions that in every 2-months survey, the bi-monthly policy report, the number of indicators that it’s looking at. At Mint, we used to have this Mint Macro Tracker, where we looked at 16 indicators every month and month on month, we tracked what was happening across Producer Economies, Consumer Economies, Ease of Living Indicators, External Sector; 4 in each head. The idea was not to rely on just one metric or number. The other thing I often find puzzling about people’s obsession with the GDP number is that at the end of the day it comes from (a) modelling exercise and spits out just one number. Why do you want to focus so obsessively on just that, when you have a much higher wealth of information out there?
I’m not saying that it’s not useful at all. For instance, if you want to see the fiscal deficit of this number, and is it something that should worry you? To normalize, you have to use the GDP. As a universal denominator, GDP is very useful. What is your company’s expected revenue growth rate over the next 5 years? If it’s 15%, is India’s nominal GDP also expected to grow at 15%? That means your company is not doing too well. Regardless of the fact that you can fool one or two journalists saying that “my company’s going to get 15% (growth)”, in reality your performance (is the same as the economy). If you’re saying 12%, you’re actually underperforming the economy. For those calculations, for those ballpark estimates, GDP is definitely useful. But when it comes to sensing the pulse of the economy, there are other, maybe superior ways of doing it. When you combine high-frequency indicators, which people already do, sophisticated investors already look at multiple variables and do not rely blindly on the GDP at all. Especially not in today’s day and age when so many other indicators are available. We have tried to incorporate that in our work, and whenever teachers invite me to colleges in economic schools, or journalism schools, this is the one thing that I always emphasise. Always look at multiple indicators, that will give you a better sense and will also open up more stories for you. Sometimes, those contradictions are resolvable. If three indicators are pointing in one direction, two in others, then maybe there is some reason as to why this is happening, and you should explore that. But in many cases, you will find that there is one metric pointing in the wrong direction, in that case, it is best to avoid it.

Ravi (58:28)
I remember you mentioning in the other interview that if you have a finding that is extraordinary, it is most likely extraordinarily wrong. I think it’s such a powerful, important point to just look at a basket of indicators and try and make sense of what is happening there.
That’s a nice segue, Pramit, to get into the next point now. You’ve got data, which is hopefully stress-tested and validated to the extent possible, now comes the job of crafting a narrative from that data. There is this age-old challenge, which is that on the one hand you want to go in with a hypothesis; you don’t want to go in completely cold and get lost in the sea of data that you have, but you also want to guard against confirmation bias. That you’ll go in with a narrative, then you’ll cherry-pick and you’ll find the data to fit your narrative. So how do you, personally, try and guard yourself against that? Maybe you can give some examples; you talked about the Monsanto one, where you had a broad narrative and the data made you change your mind. But, maybe (you could mention) a couple of other examples where this has happened with you.

Pramit (59:43)
I think what I try to practice, not saying that I’m always 100% successful all the time; sometimes, biases may creep in, but what I try to do and what I try to encourage in people who are working with me, my team, was to try and be aware of where our biases lie, and to lean against that to the extent possible. To see if the alternative hypothesis, which is questioning our beliefs, stands up to scrutiny or not. Then, after you have debated both, in your own mind, internally, and after speaking to people, looking at the data; doing all your research and ground work; if you still feel that whatever original hypothesis you had still stands, then go ahead and write it. Otherwise, you either drop the story, or write a counter story. In cases where you have a strong original hypothesis, in many cases you don’t have one, or you yourself have two or three different competing explanations of a certain phenomenon and you explore all of those, then you go with whichever comes out after you’ve considered the evidence. This is something you have to practice each time. You get better, but you can still have blind spots. I don’t think there is a limit (to when you’ll stop having biases.) I have (not) reached the limit, I’m still evolving and still trying to find better ways of dealing with this.

Ravi (1:01:12)
This is a great practice, Pramit, to actually look for contradictory evidence to what you are leaning toward. When you have (a hypothesis), do you keep it in your mind or do you write it down, saying “This is what I actually think, and so let me look for stuff to contradict it”?

Pramit (1:01:51)
When I was a Data Editor at Mint, if a journalist proposed (a hypothesis) I would force them to write it down. But when it comes to my own writing, it is sometimes just in my mind, but I’m aware of it. In some cases, I do write it down, but in other cases, it is there (in my mind). I am very, very conscious of this particular aspect.

Ravi (1:02:19)
I think it’s a great practice, Pramit, to just write it down. It makes you aware, I think writing it down is the step to awareness. That’s a great point.
Once you have the sources and the data is coming in, one of the important elements that you talk about is ‘don’t just look at the number without looking at the context of that number’. This was in the piece that you wrote about how to make sense of the budget numbers; you give away all the tricks that Finance Ministers and their ilk use to dress up a certain number. One of the things you said was “Don’t go by the speech. In the speech, they will talk up a certain sector; go by the numbers – have they actually given the budget to that sector?” and also, “Don’t just go by a percentage or an absolute amount.”
I’m going to quote a little bit from your piece, where you say that “One of the usual budget tricks is to state big hikes in percentage terms. Say, a 25% hike in the rural development ministry outlay; and to state small hikes in absolute terms, such as an increase of 5,000 crore in the allocation of health and family welfare. Maybe the 5,000 increase would reflect a 7% increase in nominal terms over the next year. And if inflation is going to be 6%, then the real inflation-adjusted increase is just 1%. And yet, that 5,000 crore increase may get parliamentarians thumping their desk in approval.”
That was a good way that you talked about it. You also talked about looking at how much the overall government spending goes up by. If it’s going up overall by 10%, then if you’re lower than 10%, you are being deprioritized. In storytelling, I call this the use of norms – Don’t go by the number on its own, look for the right norm for that number. Maybe you can talk a little bit more about how you go about saying, “Okay, I’ve got a number. How do I look for the right norms to contextualize it?”

Pramit (1:04:26)
In some cases, it is very obvious, or there is already a historically established norm. For fiscal deficit, you look at GDP; in other cases, it’s a process of iteration. It also comes from mistakes. Earlier, I may have used a certain norm, but then realized that this is not the correct way to look at it. For instance, while looking at state economies, I would look at the growth, how it changed over a certain period, and so on, till a point where I realize that the growth was very volatile. Much more volatile than the growth of the country, because individual states can have high volatility. When you’re doing that comparison, it may be better to look at it as a share of the National Product. If you look at what is Gujarat’s share in the total national pile, versus what is Bihar’s, you will see there has not been much change since the 1980s. So, the regional skew in our pattern of development that was there 30 or 40 years ago, is still there. Which you get from various other indicators and development indicators as well. So, there is no growth versus development there; there is no growth in Bihar and there is no development in Bihar. There is more growth in Gujarat and there is more development in Gujarat. Of course, there are different states (with different data); Maharashtra may have higher Per Capita Income than other states, but some of its (human) development indicators may not be as good as some state like Tamil Nadu. The Per Capita Incomes may be more comparable than the development indicators. I’m not saying that everything is a one-on-one crossover, but I’m just pointing out that when you use the right metric to even look at Per Capita Income, or GDP, or GSDP – Gross State Domestic Product, then you get a more accurate picture of where that state stands vis-à-vis the rest of the country.
There was a paper by one of my professors, Nagaraj, in EPW, where he looked at the Gujarat and Bihar model and used that method. I had already written pieces on the same issue, where I used CAGR. While his piece was not exactly contradictory, it came out with different conclusions than mine. One reason, I realized, was the use of the metric. I was looking at short periods; I was looking at CAGR for four or five years, and there was some volatility in the (data). The use of the terminal year and starting year when using CAGR can skew your numbers. So, if you take some other year as your starting point, you can arrive at a different story. That probability is much less when you do this normalization with the National Income kind of thing. There may be some minor blips, but not that kind of huge volatile (difference). Series also become more stable, and it allows for much more meaningful conclusions.
Many of these things also happen in an iterative form, in some cases by reading others’ (work), or looking (at others’ work), in some cases by speaking to people. Sometimes, you start with something, you speak to someone and they say, “Why area you looking at this? It doesn’t make sense. Look at this; this is the right metric to compare with,” then you do that. Or, you include a category. Say, you put the low Per Capita Income states in a bucket. You look at middle income, and so on. It comes from various sources, but you have to be open to learning from different sources and improving and fine-tuning constantly. And then, certain things work for you and a lot of people will tell you, “Okay, this makes sense.” And you get that feedback.

Ravi (1:08:17)
The analogy that comes to mind when you talk about this, is – I think Yuval Harari talks about this, he talks about a bird’s eye view and a satellite’s eye view, that sometimes when you’re talking about one state, you’re only looking at that state’s growth over the last 15 years. But then you’re at a certain level looking at only that state. By going immediately to the national level, you are expanding your perspective and looking at a larger data set, and that helps you to put that performance in context of the national (one). It might be increase of growth here, but overall, it’s not really made a difference. That’s super useful. To build on that, if you are looking at anything on a state or city level, you’ve got the country as the overall average. But when you’re looking at things at a country level, that’s when certain things become tricky. Especially a country like India, which is – I don’t know how to pronounce this term – sui generis; it’s almost like a unique case in the world.
There, one technique that you had used in another conversation comes to mind, which is the technique of a counter-factual. Essentially, the context to this was that you were talking about India versus China, which is a very classic one, and about the northeast and the various challenges that have happened there – the independence movements, all the violence, etc., and one interesting counter-factual that you had talked about was that a lot of people did want independence for the northeast post 1947, but what you say there is that it’s possibly never going to happen, because if it was not India, it would have been under China. And that’s a great, useful counter-factual to have, and you end that part of the conversation by saying, “If that would have been the case, I wouldn’t have been having this conversation, because you can’t have this conversation in China.”
I really loved that, because when we have a certain situation we think, “Oh, this is bad! This is terrible!” but we never ask ourselves, “what is the counter-factual here?” “If this is bad, couldn’t things have been worse?”
Especially when you look at a country like India, where you know this is bad, we don’t have literacy, we don’t have this, we don’t have that, do you sometimes consider saying, “What are the counter-factuals?” What are some right comparisons for India as a whole to say that we are actually not doing too bad here, given our situation, given that we just got independence in 1940s; and these are the areas where we are actually struggling, even though we had so much time. Do you think about that?

Pramit (1:11:07)
Yeah, all the time. In fact, in recent years, I have come to the view that India’s developmental journey since it won independence has not been that bad. I did not have this view, maybe even five years ago. But now when I look at it, the range of things we have right from the freedom to write, freedom to vote, freedom to speak our mind, to the choices that we have when we go out to the market; the sheer range of things that we can do, the things we can take part in, the kind of activism we can do, the kind of jobs we can do, the many languages that we are exposed to and can pick up if we have an inclination for (it); I think we are too critical – even I am too critical of many things, sometimes, and it reflects in my writing – and we underestimate (the country) because we don’t use the right counter-factual. You are absolutely right.
And yes, the independence part is a classic example. No one wanted independence for the ‘northeast’ as such, the Nagas wanted an independent homeland of their own; the Assamese wanted an independent homeland of their own; so on. But the original insurgency was in Nagaland, right at the founding of the country itself. And for a long time, it was funded by China. There were training camps also, and PLA assistance to the Naga insurgency. It is fairly widely known, it’s not a state secret or anything. It is quite likely that had Nagaland been able to secure independence, initially they would have got into some kind of very close partnership with China, and at some point, China could have taken over not just Nagaland, but the entirety of Arunachal Pradesh, which it anyway claims. Then, how far is it from Assam? That risk, at a certain point of time, was very real. Even now, it is very difficult for someone from the northeast to entirely dismiss the China threat. As I said, the Chinese army marched on till my hometown of Tezpur; people threw banknotes from the SBI locker into a pond. Even now, in my hometown, people go and dive into that pond to see if they can retrieve coins. I haven’t known a single case where anyone actually found anything, but that doesn’t stop young people from that adventure; diving into Padam Pukhuri, as it is called, and recovering something.

Ravi (1:13:47)
At the time they were worried that the Chinese would get a hold of all the coins?

Pramit (1:13:51)
That the Chinese would get a hold of all the coins and whatever items, so that was a real story.
To give an economic counter-factual: there is this talk about how the tax to GDP ratio in India is low, people don’t pay taxes, this, that. But if you look at it carefully, given India’s Per Capita Income, where we stand today, it is not too low. We had done a story on this, Tadit Kundu and I – readers can find it on the net – and we had looked at this counter-factual, that at India’s Per Capita Income, what was American tax compliance? What was English tax compliance? In fact, American tax compliance and Indian tax compliance, as far as I can remember, were almost identical – and this is an old story, from 2018. This complaint, in a previous economic survey, had been highlighted by Arvind Subramanian. And we had a lot of tax adventurism post that, because it gives a justification to the government to harass taxpayers to say that people are not paying taxes. I am sure there is tax evasion; the question is – is tax evasion in India unnaturally higher than any other country? Are Indian businessmen unnaturally more corrupt, or much more corrupt than businessmen in other countries? The answer is most probably – no. We tend to beat everyone in our country, be it the politicians or businessmen, with an unnecessarily harsh stick. And maybe even journalists. (Saying) that we have the worst of the lot, we have the worst of leaders, the worst of businessmen, the worst of gender (equality?); if all of these statements were true, then a diverse country like India could not have survived as a democracy for so many years. We did get certain things right; on certain things we may certainly be far worse than others; in certain things, we may have regressed. I, for instance, believe that as far as the quality of statistics is concerned, we have regressed. At one point we were the leaders of the world, right now we are not; we are lagging.
So, there are certain indicators in which we have regressed, but you have to be very careful about making the right comparisons before you arrive at the end, not just because you feel that in the heat of the moment.

Ravi (1:16:13)
Pramit, I wish we could go on for a much, much longer time. And this has been so fascinating, but I am going to take a pause here because I know that you have a commitment. I would love to continue this conversation. There is so much more that I think that I can learn from you, that we all can learn from you.

Pramit (1:16:28)
Definitely, I would love to come again.

Ravi (1:16:30)
Thank you so much, for coming onto this podcast and sharing all these insights with us. I look forward to continuing this sometime later.

Pramit (1:16:37)
Thanks, Ravi. Thanks so much, and thanks for listening.

And that was Pramit Bhattacharya, leading data journalist and an expert on India’s statistical architecture.

A few things which stood out for me in the conversation

The need to be smart about trusting the source of data
The need to write down your hypotheses/biases so that you are aware of them when you begin the research
The importance of thinking in counter-factuals

If you find this content valuable, please rate and review this podcast on iTunes, Spotify, Google Podcasts, or wherever you listen to them. It’ll help others like you discover these insights!

This podcast was hosted by me, Ravishankar Iyer. Audio editing by Kartik Rajan. Transcript editing by Amisha Jha and all-round support by Sanket Aalegaonkar.

Until next time, may the force of good stories be with you

The Story Rules Podcast E15: Crafting Data Narratives with Pramit Bhattacharya (Transcript)

E15: Crafting Data Narratives with Pramit Bhattacharya

#SOTD 31: Present tense, future perfect?

Subscribe to the free newsletter

Popular Content

Explore by Goal

Get Storytelling tips in your Inbox

Subscribe to the 'Story Rules on Saturday' newsletter

Get a free e-book that decodes the hidden storytelling structure used by leaders like Jeff Bezos, Bill Gates and Warren Buffett.