Data Science Archives - Nightingale | Nightingale | Nightingale The Journal of the Data Visualization Society Wed, 17 Dec 2025 17:02:43 +0000 en-US hourly 1 https://wordpress.org/?v=6.9.1 https://i0.wp.com/nightingaledvs.com/wp-content/uploads/2021/05/Group-33-1.png?fit=29%2C32&ssl=1 Data Science Archives - Nightingale | Nightingale | Nightingale 32 32 192620776 Analytics Products Will Never Be Truly Human-Centered Until the Workplaces Behind Them Are https://nightingaledvs.com/analytics-products-never-human-until-workplaces-are/ Wed, 17 Dec 2025 16:34:49 +0000 https://dvsnightingstg.wpenginepowered.com/?p=24457 I’ve been really excited to see a shift in analytics and business intelligence around more integration of human-centred design, ethics, and accessibility. I learn something..

The post Analytics Products Will Never Be Truly Human-Centered Until the Workplaces Behind Them Are appeared first on Nightingale.

]]>
I’ve been really excited to see a shift in analytics and business intelligence around more integration of human-centred design, ethics, and accessibility. I learn something new almost every day. However, I feel something is still missing from these conversations: whether these are being considered beyond the interface, and in our workplaces too. 

From what I’ve experienced and witnessed working in analytics, I don’t think I see the same strides in how analytics work gets done. For example, how many of us have kept producing while our lives were going through upheaval? How many have wondered if we can stay in our jobs, or even careers, because the way we’re expected to work is unsustainable to our well-being and personal lives? What might happen if we approach our work in a way that decenters speed, volume, and heroics, and recenters all humans involved?

My early days

I discovered data visualization in undergrad while studying cases like the Three Mile Island nuclear accident, where poor information design contributed to near or actual harm. It was one of the first moments in engineering where my ears perked up, especially around how data visualization bridges the analytical, creative, and human.

My early roles in quality improvement in hospitals only deepened that passion. I was fortunate to work alongside clinicians, designers, and researchers who introduced me to co-design methods, the importance of evaluation, and reframed users as collaborators.

Eventually, I landed my first role on an analytics team to support with BI design and development. However, it was during a time when my mom was battling appendix cancer, and I was living at home to support with caregiving. And my passion for this work quickly collided with the realities of how analytics gets done.

Deadlines versus trauma

When my mom was admitted to palliative care a year later, it happened to line up closely with a due date for a “high-stakes” report I was responsible for developing in Tableau, which I was learning how to use on my own. Because of the project’s size and weight, and the responsibility I felt to deliver, I would work a full day, bring my laptop to hospice care, and continue working near her bedside.

I could have asked for an extension or support. However, analytics routinely feels like a pressure cooker, especially on “high-stakes” projects. Plus, my qualifications were openly being questioned by others, I was identified as one of the “single points of failure”, and was also cautioned about the potential for blame if anything went wrong. Stepping away didn’t truly feel like an option – it was easy to feel cornered. On top of that, I was in my twenties, with undiagnosed neurodiversity, zero concept of needs and boundaries, and overwhelmed, confused, and exhausted.

At my mom’s funeral, a colleague asked when I might return to work, and relayed that people were getting anxious about report delivery. 

Her funeral was on a Friday. I went back to work on Monday. I finished developing and testing the report—and from what I remember, everyone received it when expected. 

I’m not sure if it felt like “a win” for me. It made me question, how are analytics workers perceived? And, what did I just do? 

Breaking points

The elements of that experience were not isolated to any individuals, teams, or organization, but recurring threads I’ve encountered and witnessed time and time again as my career in analytics has progressed. 

Fast-forward many years later to a more recent contract, again as a BI designer and developer, where layers of challenging, but common, systemic pressures rattled my nervous system. I eventually had a major Autistic shutdown (an involuntary neurological response to sensory overload), and needed to leave.

I’ve listed some of the challenges below – do any of these resonate, neurodiverse or not?

Structural

  • Unclear or missing roles, scoping, processes, and standards
  • Unrealistic expectations around task complexity and timelines
  • Unpredictability requiring frequent context switching and quick adaptation to change

Cultural/interpersonal

  • Persistent state of urgency, with hustle and “just get it done” culture
  • Lack of autonomy and space, with ongoing progress checks and pressure points
  • Repeatedly having to overexplain, raise concerns, and justify boundaries 
  • Interdepartmental conflict and tension
  • Feeling held responsible for the success of the project

Environmental

For this experience, I was able to be fully remote. From research and my own previous jobs, I know several factors that can be challenging with in-office environments for Autistic workers. These can include adherence to a 9 – 5 schedule, open concept office spaces with bright lighting and noise, and pressure to attend social functions. 

When layers like these start to compound, my nervous system gets flooded with input and demands, and can’t catch up. I get stuck in survival mode, and eventually break or shut down. Autistic burnout can look very different from our typical understanding of burnout, and recovery can require weeks to months (or even years) of deliberate care. Just to note, other Autistic people may have different experiences, supportive conditions, and responses – these are just my own. aces with bright lighting and excessive noise, constant interruptions, and pressure to attend social functions.

Figure 1. Examples of supportive conditions for Autistic employees from a 2023 report by Autism Alliance Canada. It is important to note that Autistic employees and employers can work together to identify the supports that might work best.

At this point, I’m afraid of returning to analytics as it currently exists. It can feel inaccessible to neurodivergence, and unforgiving to responsibilities outside of work. But am I the only one who feels this way? 

Ripple effects: Tired teams, leaders, products, and users

From what I’m seeing across industry research, I don’t think I’m the only one finding this field challenging and unsustainable. Here are some highlights:

Data teams are already overcapacity, despite ever-growing demands

In a 2023 survey of more than 900 data team practitioners and leaders across the United States and the United Kingdom, 84% said their workload exceeded their capacity, and 90% reported that it had increased from the year prior.

The vast majority of data engineering teams feel burnt out 
Another survey of over 600 data engineers and managers found that nearly all of them (97%) reported feeling burnt out, primarily due to time spent fixing errors, maintaining data pipelines, and constantly playing catch-up with stakeholder requests. Nearly 90% reported frequent work-life disruptions. 70% said they were likely to leave their current company within a year, and almost 80% were considering leaving the field altogether.

Figure 2. Experiences and impacts of data analytics work on data engineers from a 2021 report by data.world and DataKitchen.

“When a deliverable is met, data engineers are considered heroes. However, “heroism” is a trap. Heroes give up work-life balance. Yesterday’s heroes are quickly forgotten when there is a new deliverable to meet.”

2021 Data Engineering Survey: Burned-out Data Engineers Call for DataOps

Analytics products aren’t sufficiently supporting our end users

In a 2025 survey of more than 200 product leaders, data teams, and executives, 40% said their data doesn’t support decision-making sufficiently, 51% can’t meaningfully interact with the data provided, and 29% export data to spreadsheets daily. 

Findings I’m not surprised to see, considering how we’re expected to work. From a design perspective, it can be a struggle to carve time and space to sufficiently understand the data and users before I’m asked to quickly turnaround a prototype. Plus, post-launch follow-up and evaluations don’t seem to gain traction before we’re onto the next priority.

We’re hoping AI will save us

In the same survey as above, 75% believe AI-powered analytics might finally help uncover value buried in data. But in a new study by MIT and Snowflake, 77% of data engineering teams are finding their workloads even heavier, despite AI integration. 

While AI has the potential to streamline tasks and improve product quality, a cracked foundation could limit its impact, and cause further complexity and burnout. 

Figure 3. Examples of external and internal pressures in analytics, as well as possible outcomes.

Diverse does not equal inclusive

In analytics, we often point to diversity as evidence that we’re on the right path. When concerns are raised about how pressures, workloads, and expectations may weigh differently across identities, they can be dismissed with the reassurance that our workplaces are “already pretty diverse.”

That might be partially true in terms of representation. A recent study by Statistics Canada showed that 60% of data scientists (one of many roles within analytics) are immigrants, with the majority of first languages being neither English nor French. About one-third of data scientists identify as women+ (defined by the study to include “women and some non-binary people”). 

It is important to recognize that diversity does not always equal inclusion. In other pieces published by Nightengale, Catherine D’Ignazio and Lauren F. Klein, authors of Data Feminism, speak to how racism and sexism are imbued in the end to end data lifecycle, reinforced by structures of power, and ultimately surfacing in our products. An online poll by Christian Osborne showed that 90% of respondents said that they’ve experienced microaggressions at work, which can cause emotional and psychological harm, decrease job satisfaction, and increase turnover. 

We can also be sensitive to trends across all workplaces. In 2024, the Diversity Institute, Future Skills Centre, and Environics Institute for Survey Research published a Canada-wide study on gender, diversity, and discrimination at work. The survey reinforces that workplace discrimination is more likely to be experienced by racialized and Indigenous peoples, women, persons with disabilities, 2SLGBTQ+ individuals, and young adults. It is crucial to recognize that intersectionality amplifies these effects, with racialized and Indigenous people more likely to face multiple forms of discrimination, especially related to gender, age, and disability. And, those who reported experiencing discrimination also reported poorer mental health. 

Even with diversity, we still need to ensure that our analytics workplaces make everyone feel safe, healthy, empowered, and valued. Diversity, equity, and inclusion (DEI) programming remains urgent and necessary, and should not be deprioritized or defunded. In the systemic pressures previously discussed, I wonder how these are felt across different identities. For example – what are the experiences of a woman in a leadership role, a recent immigrant who is supporting family both at home and overseas, or a new grad with one or more disabilities – are they really all the same?

What if we worked differently, and prioritized people first?

The tendency for analytics workplaces to be top-down, reactive, chaotic, transactional, and overburdening clearly isn’t working—not for our people, and not for our products. We’ve got more than enough burned out workers and leaders, and more than enough underused products to prove it. And I’m only seeing signs that analytics (and tech more broadly) might be becoming even more unsustainable—from 996 culture, mandatory RTO policies, pressure to upskill for AI, low data readiness for AI, to the defunding of DEI.

I think systemic change (or a reset button) is required to humanize our approach to analytics work. The shift has to include not only analytics teams, but also the ecosystems that rely on us. 

For example, earlier this year, the Canadian Occupational Health and Safety Magazine suggested that workplaces adopt a trauma informed care (TIC) approach to work. This approach places safety, trust, and empowerment at the center, and recognizes that many of us have experienced trauma—trauma that workplaces can trigger, perpetuate, or even create. Normalized approaches to analytics work can actually be quite harmful, like unpredictability, constant urgency, ambiguity, and the erosion of autonomy. 

The article references the six pillars of TIC laid out by the Substance Abuse and Mental Health Services Administration (SAMHSA), and cites research that shows its positive impacts to employee well-being, satisfaction, retention, operational functionality and effectiveness, and cost efficiency. 

Figure 4. Six key principles of a trauma-informed approach, published by the Substance Abuse and Mental Health Services Administration (SAMHSA).

I have listed the six pillars from SAMHSA below, along with my attempt at (extremely) high-level and brief descriptions tailored to those of us working in analytics. I am still on my own learning journey. 

  1. Safety: Prioritize physical and psychological safety in all elements of the workplace. In analytics, this can mean that people are able to seek clarity, name concerns, and admit uncertainty without fear of punishment or loss of credibility. It can also mean that we respect limits on things like working hours, cognitive load, personal space, and sensory needs.
  2. Trustworthiness and Transparency: Build trust through consistent transparency around decisions, timelines, priorities, and changes. Clarity and predictability can reduce uncertainty, prevent reactivity, and stabilize teams.
  3. Peer Support: Reduce isolation and barriers to connection to foster peer support within and across teams. This can allow for greater understanding across disciplines and parts of the organization, smoother workflows, supportive relationships, shared problem-solving, and better knowledge transfer.
  4. Collaboration and Mutuality: Involve workers in decisions about policies, procedures, tools, standards, and more. Also, when business units and analytics teams better understand each other’s capacities, workflows, complexities, timelines, needs, etc., collaboration might be more smooth, respectful, and productive. 
  5. Empowerment, Voice, and Choice: Choice and control are essential for trauma-impacted people. In analytics, empowerment could mean giving workers more agency in defining things like their own scope, workflows, documentation, timelines, training needs, and work arrangements.
  6. Cultural, Historical, and Gender Sensitivity: Address systemic inequities and promote diversity, equity, and inclusion. Design systems from the start to acknowledge, understand, and respect differences. Do not rely on people to constantly identify, overexplain, or advocate for their needs.

Integrating TIC is a deep, long-term commitment that isn’t about checking boxes, a quick workshop, or adding a few supportive practices. It requires honest and sustained cultural and structural assessments, learning, planning, and shifts, and a more balanced distribution of power. But with a new reframing, maybe we can begin to view:

  • Workers as human, collaborators, creators, and both autonomous and interdependent 
  • Leaders as human, coordinators, facilitators, coaches, guides, and anchors
  • Work as collective, learning, growth-oriented, and sustainable 
  • Technology as supportive, enhancing, synchronizing, and shared 

This isn’t meant to be a silver bullet, and I know there are many other challenges in analytics that involve data, tools, processes, and more. It may also seem overly idealistic in our current systems. But I feel like tech is at a precipice, especially in the rush toward AI creation and adoption. We’re already seeing increased exploitation of labour and the environment in the AI space, without consideration of short or long term consequences. If we don’t care to stop and make our systems more sustainable, ethical, equitable, and accessible now—what does this mean for our (very near) future? 

I’m curious about what a different approach to analytics work might bring:

  • Will we have the space to maintain our health, relationships, and lives outside of work?
  • Will relationships within and between teams become more stable, empathetic, and productive—especially between analytics and business units?
  • Will we have more space in between deliverables to recover, reflect, and refine our systems?
  • Will our products become clearer, more cohesive, more aligned, actually used, and have impact?
  • Will we feel safe and supported to show up at work in our own unique ways?

The post Analytics Products Will Never Be Truly Human-Centered Until the Workplaces Behind Them Are appeared first on Nightingale.

]]>
24457
Data Visualization & Affective Computing. Design That Manipulates Emotions or Design That Helps Reflect on Emotions? https://nightingaledvs.com/data-visualization-affective-computing/ Thu, 06 Nov 2025 16:49:32 +0000 https://dvsnightingstg.wpenginepowered.com/?p=24381 What are emotions and how design is connected Emotions are complex. They are not feelings nor are they desires. I’ll define emotions as a biopsychological..

The post Data Visualization & Affective Computing. Design That Manipulates Emotions or Design That Helps Reflect on Emotions? appeared first on Nightingale.

]]>
What are emotions and how design is connected

Emotions are complex. They are not feelings nor are they desires. I’ll define emotions as a biopsychological process that happens inside the body and is an information-processing tool. I heard emotions being opposed to rationality—by some coincidence, pretty often in a sexist logic. But it’s quite the opposite, and emotions matter in effective decision-making. The way interactive interfaces, data visualizations, and other design systems that surround us are constructed may influence our emotional experience and processing. The ability to meaningfully experience data visualizations through emotional feedback enhances engagement.

In a design context, how we reflect on our emotional experiences can vary depending on the system architecture. That matters because we are surrounded by interactive systems, from AI-based digital products to newsroom data visualizations and train ticket machines. This is why the framework for constructing design systems that fascinates me is affective computing, a discipline researching how emotions can be detected and responded to by interactive systems. Currently, techniques such as emotion recognition via audio, speech and physiological data as well as sentiment analysis of textual evaluations and opinion mining are used to get this information. But is this factual data enough to effectively and empathetically evaluate the meaning of communication? Boehner and colleagues wrote that there are a lot of caveats to interpreting emotions, such as limitations of the given evaluation method. This means that the methods of factual evaluation should be combined with cultural understanding and nuanced assessment.

Emotions are constructed not only within physiological, but also cultural and social contexts. Emotion manifestation must make sense within the cultural context in which we live, and it matters what our own reaction to our emotions is. Am I ashamed to be openly angry, or rather feel justified? (Although not emotion, I keep thinking about a North Korean refugee explaining that there is no concept of “depression” in North Korea. Then, when one has depression in North Korea, how is the interaction with this state constructed and articulated?) Dr. Rosalind Pickard, a creator of affective computing, said that “Originally, affective computing was an area of research created to give technology the skills of emotional intelligence. The goal is to create technology that shows people respect, such as by not continuing to do things that cause people to become frustrated or annoyed.” She also says, “There is no magic sensor that will accurately convey how someone is feeling. We need to combine AI learning with lots of information from multiple channels gathered over time to even make a guess at feelings.”

Two models

Boehner and colleagues proposed a classification of affective computing, depending on what the goal is: informative and interactive models.

The informative model is based on the idea that emotions can be classified, symbolically encoded, categorized, and transmitted. The success metric for such a system is whether the emotion I sent to another user is interpreted correctly.

The interactive model is based on the idea that emotions are constructed in a process of interaction. The goal of the system in this framework is to provide a place to reflect on emotion. If the system helped the user interpret, understand, and reflect on their emotional state, it is a metric of system success.

Scheme by the author

How models work

I find the comparison between informative and interactive models important today, because in the world of deceptive patterns, misguiding charts and issues related to the AI that maximizes engagement over safety, it is important to design systems that can not only comprehend emotions, but also give a safe space for reflection and empathy without pushing the limits.

How would an information model be different from an interaction model? Let’s say I have a conversation with my childhood friend. Two scenarios:

  1. We use instant messaging, emoji exchange. Transmission of emotion is limited by the range of animations / character settings (Still, it can be fun). My sadness becomes a static crying cat, and my friend’s irony becomes an animated character resembling her, but with orange hair. This is an information model.
  2. We use email agent EmoteMail (old and dead, but interesting nevertheless). EmoteMail took photos of the user’s face when she was writing an email. The photo was automatically placed approximately near the paragraph that was written when the photo was taken. Moreover, paragraphs were color-encoded to reflect the time duration of writing. (Happy this is not my compulsory work email agent. But the conversation with a friend, long-distance flirt or quarrel could be, perhaps, an interesting experience). This is an interaction model.
Emotemail. (Source)

We have two totally different conversation spaces. The system can initiate analysis of the interaction result (information model, emoji) or catalyze interpretations of interaction (interaction model, Emotemail). Unlike emoji-type apps, EmoteMail-type apps remove the predetermined classifications, providing more direct access to the emotion of another person and an instrument for interpretation via data collection and representation. Predetermined classifications may limit the overlay of cultural and situational context (Have you ever had a desire to react to someone’s Instagram story, but pressing the heart, fire, or clapping hands was deeply contextually inappropriate?) In EmoteMail, the context of two people enhances the meaning of interaction, and the visualization of behavior provides clues to a person’s emotional state, but it doesn’t provide answers. It allows users to draw their own conclusions, providing a framework for data collection as a playground for interaction. However, the design system might have been pushed too far. In a forum discussing the project, a user Adam Kazwell wrote: “the thing that scares me about EmoteMail is having the recipient see what didn’t show up in the final draft. What value is added knowing that I misspelled recipient 3 times before I posted this comment? (…) If you want more than just cold-hard text, maybe pick up a phone or meet face-to-face :)” It was in 2004, and in post-COVID 2025, full of cold-hard texts and videoconferencing not being a perfect substitute for face-to-face communication, perhaps it is a good time to build design systems using both interactive and information models.

Not only may we not want to overshare in design systems, as Adam Kazwell mentioned, but sometimes we need time to process and understand the emotion we are experiencing. That’s why I don’t talk to AI about my emotions—I don’t want any priming and forcing. I need to get there myself. And with the latest Congress hearing Examining The Harm Of Chatbots on the tragic deaths of teenagers who interacted with AI agents, the question of the safety of technologies that can mimic empathy as confidants is pressing. And perhaps an interactive model of computing can provide ideas on how to construct safer systems and balance a widely used information model that may understand emotions, but drives them for engagement.

An example of an interactive model in data visualization that gives space for reflection on emotions is the Tied Knots project, which tells stories of harassment in academia. It provides users with a space to reflect on emotions and lived or observed experiences, and fosters a sense of community without pushing users to any particular conclusion or emotion. It prompts to assess the situation and maybe even make some personal decisions. Another example is Affective Diary, a data visualization project that empowered participants to track their emotional experiences via guided questions and sensor-tracking of arousal and movement – something that can be found in health trackers like Oura. However, researchers found that using graphs was not the best way to connect with emotional experiences, and proposed that data visualization for empathy should look “familiar”. 

The way social media platforms mine data about users is related more to the informational model, with the intention of correctly understanding, predicting, and profiting from the user’s emotions. On the positive side, health tracking apps and devices also utilize an informational model by collecting physiological data to assess the physical and emotional state of the user, which, with ethical data collection, can promote well-being and improve health. A good example of an informative model that provides such reflective space without pushing boundaries is the app How We Feel, which helps users understand their emotional state by offering hundreds of emotions to choose from, each with its own classification. At the end of the week, the user receives a data visualization of emotion distribution, as well as tools to manage them.

How We Feel new app feature. Image provided by the author.

What could be done better

I feel that the informative model, although useful and important, is overexhausted by the goal of marketing, when the interactive-based model is a humane framework of human-computer co-existence that might be important to add in a current approach to business and metrics. So many daily interactions with design systems (especially scaled to serve a lot of users) can handle big data, but lack human touch and compassion.

I love the idea that a design system doesn’t get to know my emotions to analyze me (I don’t like you, Facebook), but instead gives me a space to make sense of my emotions. The question is: how can we sustain such design at scale, systemically?

There are promising examples of empathy integration and meaningful interaction into the business model. For example, a Deep Viewpoints application developed for the Irish Museum of Modern Art provides visitors with a digital platform to share the emotions they feel when interacting with an artwork. Through mediation, users can also share reflections and questions in the form of a digital script, allowing others to access and utilize it for their own reflective and interpretive experiences. Such app allows museums to better understand their communities, and for minoritized communities to have a participatory space in a cultural dialogue. It allows people to be active participants when they interact with the cultural heritage, an important engagement practice.

Deep Viewpoints. (Source)

Another example of a sensory experience that involves a space for reflection is a study at Blair Drummond Safari and Adventure Park. Researchers built a multi-sensory device that allowed red lemurs and visitors to interact through smell, sound and videos. Researchers found that not only did people stay for longer times, but also that it increased their empathy towards animals and increased their educational outcomes. 

“We allowed people to share in the same experience to try and get people to have a sense of understanding of other, that we are sniffing together, helps makes animals more relatable and understandable.” said Ilyena Hirskyj-Douglas, an director of the Animal-Computer Interaction Lab who led the project, “As people our impact on animals and the planet is far reaching and I hope that this empathy can shape how people think and behaviour towards animal conservation. Though it is really unknown what the lemur in this case thinks. Some zoo keepers think of animal-zoo visitor interaction it as a type of enviromental enrichment.”

Perhaps if empathy is not an engagement extraction resource, but a space to build both connection and business, our design and social systems both benefit.

CategoriesData Science

The post Data Visualization & Affective Computing. Design That Manipulates Emotions or Design That Helps Reflect on Emotions? appeared first on Nightingale.

]]>
24381
I Stopped Using Box Plots: The Aftermath https://nightingaledvs.com/i-stopped-using-box-plots-the-aftermath/ Tue, 28 Jan 2025 15:55:19 +0000 https://dvsnightingstg.wpenginepowered.com/?p=22843 I recently learned that my 2021 article about why I no longer use box plots is now the second-most-read article in Nightingale’s history🤯 (or, at..

The post I Stopped Using Box Plots: The Aftermath appeared first on Nightingale.

]]>
I recently learned that my 2021 article about why I no longer use box plots is now the second-most-read article in Nightingale’s history🤯 (or, at least, since Nightingale moved to its current hosting platform). What do you do when you have a hit on your hands? Milk it, baby, by writing a sequel 😎

When that article came out, I got a lot of comments and replies. Like, a lot a lot. Like, I spent three days responding to them. There were all sorts of comments, of course, but there were definitely common themes. This article summarizes the most common replies that I received, along with how I responded to each, making it very much a sequel to the original article, just with several hundred new coauthors. Well, uncredited coauthors🤷

The majority of the replies that I received expressed some form of agreement, with chart creators thanking me for helping them understand why their box plots flopped with audiences or for making them aware of alternatives like strip plots and distribution heatmaps. You’re welcome!

There were, however, also plenty of thoughtful objections and counterarguments, and I’ll be focusing on those because reading about people agreeing with one another is pleasant and boring.

Alrighty, then. First up is…

“This [example box plot] is useful! I can clearly see [insight, insight, insight, etc.]!”

I wasn’t suggesting that box plots aren’t useful. Obviously, they can show useful insights. I was suggesting that simpler chart types like strip plots and distribution heatmaps can show all the same insights that box plots can, but are easier to understand, less prone to misinterpretation, and don’t hide potentially important information. I wasn’t claiming that box plots are useless, just that, when compared with other distribution chart types, box plots have some significant disadvantages and no identifiable advantages, so it might make sense to use other chart types instead.

To dispute the claim that I was making, then, you’d need to show the same dataset as a box plot, strip plot and distribution heatmap, and then identify specific insights that are clearer in the box plot than in those simpler chart types. Many people did send me box plots, but most didn’t include strip plots or distribution heatmaps of the same data. This made it difficult or impossible to see if the insights that they pointed out in their box plot would have been just as clear in those simpler chart types. None of these responses, then, actually addressed the claim that I was making.

Some people did step up, however, such as Sergio Garcia Mora, who showed the same dataset in a variety of chart types in this fantastic article:

A box plot compares salary distributions for HR roles in Argentina by gender. Male employees generally have higher medians and larger ranges than female employees across roles such as Analyst, HRBP, and Manager. Purple and teal differentiate genders.
A scatter plot displays salary distributions for HR professionals in Argentina by gender and role, with distinct median lines for male and female employees for roles like Analyst, HRBP, Supervisor, Head, and Manager. Purple represents female employees, and teal represents male employees.

This is what Sergio wrote about the box plot version:

“What I like about this visualization is that we can see the distribution of the salaries by the size of the halves of the boxes. Let’s take for instance the Head position. The medians are similar, but in the case of women, the bottom half of the box is larger, so that means that the range of salaries for women is broader. That tells us that there are women in Head position with salaries far below the median.

The opposite happens with male professionals in the Head position. The top half of the box is larger meaning that there are men in the Head position with salaries far above the median.”

To my eye, anyway, all of these insights are at least as clear in the jittered strip plot version. Plus, I could see several insights in the strip plot that weren’t visible in the box plot, such as the fact that there are fewer employees in the more senior roles, that no Managers make between about AR$85K and AR$110K, etc.

There might be box plots out there that show insights that aren’t as clear in simpler chart types, but I have yet to come across a single one. If you have one, send it to me! (Just make sure to include a well-designed strip plot and distribution heatmap showing the same data, s’il vous plait.)

“Box plots are useful because they show quartiles.”

Quartiles aren’t insights, they’re just features of charts that allow readers to spot actual insights like, “The salaries in Company A are more dispersed than the salaries in Company B, which suggests that there’s more room to move up in Company A.” That’s an insight, and you almost never need quartiles to spot those.

Saying that “box plots are a useful way to show quartiles” is like saying that “distribution heatmaps are a useful way to show the bins/intervals that the values fall into.” These aren’t insights, they’re chart features that allow readers to spot insights. What ultimately matters is how clearly each chart type shows insights, not the specific mechanisms that are used to make those insights clear.

Having said that, there are rare cases when quartiles have some special meaning. For example, maybe a company has decided to lay off the middle 50% of its employees based on salaries (which would be weird but, like I said, these are rare cases). Even in a scenario like that, though, interquartile ranges (i.e., the middle 50% of values) could be shown in strip plots and distribution heatmaps, which would still be easier to read and clearer than box plots:

Side-by-side visualizations highlight age distribution by group, with scatter plots overlaid on box plots on the left and a heatmap representing interquartile ranges in yellow on the right.

Like I said, though, it would be very rare to have to do this in practice because, in the vast majority of charts, quartiles (or quintiles, terciles, etc.) have no special meaning and aren’t needed in order to spot useful insights.

“Box plots make outliers easy to spot.”

That’s true, but outliers are just as easy to spot in simpler chart types. For example, in the “salaries by role” jittered strip plot that I showed earlier, the outliers are pretty obvious—they’re the dots that are far away from the main cluster of dots. You could make outliers in a strip plot even more obvious by highlighting those dots but this seems unnecessary; their location away from the other dots already identifies them as outliers.

Outliers can also be added to distribution heatmaps, similar to how they’re added to box plots:

A heatmap of age distribution by group categorizes individuals into age ranges, highlighting the percentage of total members per group in varying shades of blue. Outliers appear as distinct circles.

“Box plots work well when there are many distributions to show because they look less visually busy.”

Some people sent me box plots with many sets of values, like the one below, arguing that other chart types would be even busier looking:

A box plot compares employee salaries across 12 companies, showing varying medians and ranges, with some distributions skewed and a few outliers evident.

It’s true that strip plots can look quite busy when there are many sets of values in a chart, but distribution heatmaps are well-suited to these situations:

Personally, I find that the graphics in a distribution heatmap actually are less visually busy than boxes and whiskers, but this is probably subjective.

“Why not combine box plots and strip plots to get the best of both worlds?”

Some people suggested combining strip plots and box plots, like this:

A vertical box plot shows age distributions for three groups labeled A, B, and C. Each plot includes individual data points, medians, and ranges, with Group A showing the widest spread.

Yes, you could do this, but the question then becomes: which specific insights are the boxes making clear that wouldn’t have been clear in the strip plot on its own—perhaps with the medians added, since they’re often relevant? I can’t see any such insights, so the boxes just add complexity without adding any value, IMHO. Basically, I don’t think this is a “best of both worlds” solution because there’s no “second world” in this case, i.e. insights that box plots would show that wouldn’t already be clear in strip plots.

“Sure, box plots don’t work well with multimodal distributions, but they shouldn’t be used to show data like that in the first place.”

A number of people objected to this graphic from the 2021 article:

Two side-by-side plots contrast age distributions for a test and control group. The box plots suggest similarity in medians and spreads, while the scatter plot reveals distinct clustering patterns within each group.

They objected that this wasn’t a valid use case for a box plot because box plots should only be used with unimodal (“bell-shaped”) distributions, not multimodal (“clumpy”) distributions, such as the “Control group” in the jittered strip plot above.

The problem with this objection is that it assumes that readers can always be certain that no chart creators will ever use box plots to show multimodal distributions. If you see a box plot in the wild, though, how can you be certain that the person who created it didn’t decide to use a box plot even though the data contained multimodal distributions? And what about box plots that are dynamically generated based on live data, and in which the distributions might be unimodal on some days and multimodal on others?

Basically, with box plots, readers are always left wondering if the distributions in the chart are unimodal or not—assuming that they’re even aware of this problem in the first place. Chart types like strip plots and distribution heatmaps, however, show unimodal and multimodal distributions clearly and so avoid this problem altogether.

“Box plots are a better choice for more data-savvy audiences.”

Even for audiences that are extremely statistically literate and very used to reading box plots, I’m not sure what benefit box plots would offer that wouldn’t also be offered by simpler chart types (sounding like a broken record now, I know). I am, however, pretty sure that box plots would hide potentially important information from them (gaps, clusters, etc.).

“We shouldn’t be afraid to use chart types that audiences aren’t familiar with. / We should try to teach audiences to read more advanced chart types.”

Totally agree. Indeed, in my Practical Charts course, I cover chart types that many audiences aren’t familiar with, such as step charts and scatterplots (see this article for a more complete list of “basic” chart types that many audiences aren’t familiar with). I cover these potentially unfamiliar chart types in my course because there are certain types of data and certain types of insights that can’t be communicated using simpler, more familiar chart types and so, sometimes, more complex or unfamiliar chart types are unavoidable, and you might need to teach the audience how to read them.

If you’re going to ask an audience to spend their valuable time and brain cells on learning a new chart type, though, there’d better be an “epiphany payoff,” as data storytelling expert Brent Dykes would call it, to justify that effort. I’ve just never seen any epiphany payoffs from box plots that couldn’t also be obtained with more familiar, less effortful chart types.

“There are no bad chart types. All chart types have situations in which they’re the best choice.”

I hear this all the time but I’m not sure why it would be true. It’s easy to forget that chart types are just human inventions, like printing presses and electric toothbrushes; they aren’t fundamental properties of the Universe, like mathematical principles. In fact, box plots are a relatively recent invention, having only been first proposed in the 1950s.

As with any other type of invention, there’s no rule that says that every type of chart needs to have situations in which it’s the best choice. Indeed, the pantheon of human inventions that were the best solution in exactly zero situations is well populated. I wrote more about this idea here.

Box plot defenders also virtually never mentioned one of the major problems that I described in the 2021 article, which is that box plots don’t make “visual sense.”

For example, have a look at the box plot below:

A horizontal box plot shows data spread from 10 to 90, with the interquartile range spanning from 25 to 75 and a median at 50. Whiskers extend to the minimum and maximum values without outliers.

Even to people who are fairly experienced with box plots, it looks like there’s a large cluster of values in the central part of this range.

If you deeply understand box plots and think about it long and hard enough, however, you’ll realize that this box plot shape actually must mean that there are few values in the central part of this distribution, and this data set would have to look something like the jittered strip plot below (which is showing the same data as the box plot above):

A horizontal scatter plot shows two clusters of data between 10–30 and 70–90, illustrating distinct distributions.

That’s really, really not what the box plot seemed to be showing, though, and there are many other situations in which even experienced box plot readers must “think around” these perceptual paradoxes in order to avoid misreading the chart. Yes, this gets a bit easier with practice, but why use a chart type that forces readers to perform these kinds of cognitive gymnastics when there are readily available alternatives that don’t?

So, did any of these exchanges change my opinion about box plots?

As you can probably guess, I still don’t think that box plots are ever a better choice than alternative chart types, however, that’s now a much more thought-through opinion because people took the time to challenge it with such thought-provoking arguments, and I’m extremely grateful to everyone who chimed in. I remain open to being proven wrong and welcome additional comments and examples, just be sure to include a strip plot and distribution heatmap of the same data. To reply, comment on the post of this article on LinkedIn or Bluesky, or reach out to me via this contact form.

If you still feel that box plots have their place and you’ll continue to use them, that’s totally kosher. I certainly won’t call out anyone for using them, and all of this is just my opinion, of course. I would, however, still urge you to consider alternative chart types for one more reason that I haven’t mentioned yet…

Unfortunately, I’ve seen plenty of people feel needlessly stupid because they found it so difficult to read box plots, or failed to grasp them entirely. Unless you’re certain that all of your readers already understand box plots, avoiding making people feel dumb for no reason might be the best argument of all to consider alternative chart types instead.

The post I Stopped Using Box Plots: The Aftermath appeared first on Nightingale.

]]>
22843
Distortion by Design https://nightingaledvs.com/distortion-by-design/ Tue, 07 Jan 2025 16:37:51 +0000 https://dvsnightingstg.wpenginepowered.com/?p=22769 Distortion noun /dɪˈstɔːr.ʃən/ — A change to the intended or true meaning of something; a change to the original or natural shape of something; a..

The post Distortion by Design appeared first on Nightingale.

]]>
Distortion noun /dɪˈstɔːr.ʃən/ — A change to the intended or true meaning of something; a change to the original or natural shape of something; a change in or loss of sound quality, due to changes in the shape of the sound wave. (Cambridge Dictionary)

Framing distortion

Distortion entails some kind of re-representation of something by changing its shape. Molding it through deformation. Bending it to create a new perspective. By changing shape, distortion introduces new interpretations, emphasizing certain aspects while downplaying others. Unlike abstraction or filtering, distortion inherently warps structure. It is not simply change; it is, by its very nature, a deliberate structural deformation. This fundamental characteristic sets distortion apart from abstraction and filtering. Abstraction is an act of generalization to gain broader, contextual understanding. For instance, Singular Value Decomposition (SVD) is a method of abstraction; by reducing the dimensionality of the data, the representation is simplified, but the original structure remains intact. It aims to lower the structural fidelity without distorting it. 

In 3D modelling, an analogous action would be to decimate: to reduce the number of polygons of a mesh while minimizing shape changes. Although the shape will change, reshaping is not the intention of the algorithm. In descriptive statistics, an example would be deriving and using means in place of original values; not aiming to distort, only abstract. Filtering omits certain elements and narrows the representation, but does not alter the remaining structural parts. For instance, when filtering rows in a dataset, the remaining items stay unaltered, retaining their original structure and relationships. Unlike distortion, it does not reshape or bend the underlying meaning. It simply excludes parts.

The image features four scatter plots arranged in a 2x2 grid, comparing urban and rural garden yields. The top-left plot, labeled "Raw Data," shows a scatter plot with orange points representing urban gardens and green points representing rural gardens, plotting garden size on the x-axis and annual yield on the y-axis. The top-right plot, labeled "Distorted," uses the same variables but displays randomized data, creating a less coherent pattern. The bottom-left plot, labeled "Abstracted," simplifies the data into two large points representing the average values for urban and rural gardens. The bottom-right plot, labeled "Filtered," removes urban gardens entirely, displaying only the green points for rural gardens.
Figure 1

The primary means to communicate information includes visual, auditory and physical mediums. Optical distortion involves reshaping or bending light waves in some way, as in the case of a microscope; microscopes magnify details by bending light through a lens. Sound distortion implies a change to the shape of the sound wave itself. In music, guitar distortion originates from increasing the volume beyond what the speaker can handle. Since the speaker cone is physically constrained from moving the required distance, the sound wave is compressed at its peaks, producing a distinct aggressive tone. Mediums such as these exist in physical space, and so do their distortions. 

In visualization, space exists purely as a construct, allowing designers to bend, displace and collapse it in ways that would be impossible in reality. For instance, virtual space can be bent by applying modifiers to the representation of the spatial domain. In physical reality, achieving the same effect would require manipulating the fabric of space-time itself, something only achievable by extreme physical phenomena, such as black holes. This gives god-like powers to the designer, and with great power comes great responsibility.

In visualization, distortion primarily manifest in two ways: by bending space, or by reshaping the data. Bending space involves altering the spatial domain, such as mapping input values to a logarithmic scale, effectively reshaping the environment in which the data resides. Reshaping the data, on the other hand, involves transforming the function/value domain, such as applying a logarithmic transformation to the output values, which changes the data itself while leaving the underlying space unaltered. 

This image contains two line charts illustrating the use of logarithmic scaling. The left chart, titled "Bending Space," shows the function f(x) = x plotted on a logarithmic x-axis, creating a curve that emphasizes the distribution of values across the scale. The right chart, titled "Reshaping Data," also plots f(x) = x but applies a logarithmic transformation to the y-axis, altering the shape of the curve. Both charts include orange lines and gridlines that highlight the effects of the scale transformations.
Figure 2

In visualization, bending space is a common method to emphasize areas without altering the underlying data. For instance, fisheye views magnify areas of interest while compressing the surrounding area. In three dimensional visualization, a special case of distortion occurs when projecting onto two dimensions. One might compare it with dimensionality reduction in SVD, and ask: is projection an act of distortion or abstraction? I think it is both. It abstracts by reducing a three-dimensional scene to a two-dimensional representation. At the same time, projection is designed to simulate light traveling through virtual space. By altering the projection, one can simulate optical distortions, such as the bending of light waves through air, materials, and our eyes. This already poses a distortion dilemma: which projection should one choose? One that emulates how humans perceive the world in perspective, or one that project space orthogonally? Arguably, perspective projection introduces more distortion mathematically, still, orthogonal projection appears more distorted to the human eye.

The good, the bad, and the ugly

Although changing the true meaning of something might seem counterintuititve when searching for the truth, in some cases, distortion is strictly necessary. Take the microscope, for instance: distortion is what makes the phenomenon visible in the first place. In other cases, distortion might not be strictly necessary but is used to enhance comprehension by emphasizing features or changing perspective. However, distortion is a double-edged sword. While it can enhance comprehension, it can also mislead when applied badly. This raises ethical considerations about the way distortion is used, especially in scientific communication. 

Good use occurs when distortion improves perception without misleading. Bad use can be unintentional, such as poor design choices. Bad use can also be intentional, such as deliberate manipulation to exaggerate certain narratives. Intentional misuse of distortion (the bad) is ethically inexcusable, since it deliberately seeks to mislead. Unintentional misuse (the ugly), while perhaps more ethically excusable, might still lead us astray on our path towards truth. The boundary between bad and ugly is often blurred; one can always claim actions was unintentional, even when deliberate. This complicates the ethical landscape. Ultimately, regardless of intent, the harm lies in the result: when distortion misleads, it betrays the trust of the audience and compromises the integrity of all science. Therefore, designers have a responsibility to ensure their distortions are aligned with the pursuit of truth.

In the paper “Graphs with logarithmic axes distort lay judgments” by Ryan et al. [9], the problem of misunderstanding exponential growth is explored in an empirical study. It shows, using examples of Covid-19 data, how the choice of a linear or logarithmic scale on the function axis impacts the perception of growth. They demonstrate that logarithmic scales, while useful in research, often lead to less accurate judgements among lay audience. Participants viewing logarithmic scales were more likely to underestimate the severity of the pandemic, perceive Covid-19 as less dangerous, and express less support for interventions such as mask-wearing and social distancing. While logarithmic scales are important tools in research, one needs to be careful when presenting them to the public. As Ryan et al. suggest, the communication of such graphs should ideally include clear explanations, or be supplemented with linear representations, to lower the risk of misunderstanding.

A common trick with barcharts is to truncate the length-axis, effectively zooming in on a portion to exaggerate the differences in bar lengths. Whether this should be considered distortion or not is an interesting question: The remaining scale is not altered, so are we not simply filtering out irrelevant portions of the axis? This would be true for a dot plot, since points do not have lengths that can be altered. However, in bar charts, the length of a bar is conceptually significant. It stretches all the way to the start of the axis. By truncating the length-axis, you are effectively compressing the omitted section into an infinitely small area, fundamentally distorting the function domain. 

Two bar charts in the image compare data using different axis scales. The chart on the left, labeled "Full Range," presents bars labeled A to E with values spanning the y-axis from 0 to 100, providing an overview of the full dataset. The chart on the right, labeled "Truncated Axis," narrows the y-axis range to 80–100, magnifying small differences between the values and emphasizing variations that were less noticeable in the full range.
Figure 3

In “How Deceptive are Deceptive Visualizations” by Pandey et al. [7], truncating the length-axis falls under the category of message exaggeration. While the actual data remains unaltered, the manipulation of visual encoding exaggerates the perceived magnitude of differences, misleading the viewer into overestimating the variation between data points. To help visualization creators avoid unintentional distortions, the authors advocate for principles and guidelines that promote best practices. For lay audiences vulnerable to intentional distortions, the authors highlight the importance of education to enhance their ability to recognize problematic and potentially deceptive information.

In visualization research, two-dimensional view distortion techniques have received a lot of attention. These techniques aim to address the challenge of presenting large, complex datasets within constrained areas. By selectively magnifying regions of interest while compressing the rest, view distortion techniques allow users to focus on details without losing contextual information. Inspired by effects from cartography and photography, such as the fisheye lens, these techniques distort space to highlight important information. 

In “A Review and Taxonomy of Distortion-Oriented Presentation Techniques” [2], Leung et al. identify good and bad uses of view distortion techniques. On the positive side, view distortion can enhance comprehension by dynamically allocating screen space to areas of interest. For instance, in cartographic applications, distortion can allow users to zoom into densely detailed regions while retaining an overview of the surrounding geography. On the negative side, distortion can mislead users by overemphasizing less important areas, causing them to appear more significant than they are. While Leung et al. does not express any ethical concerns in their paper, they emphasize the importance of clarifying what is distorted and what is not, to make sure the audience remain unbiased. Ethical concerns around view distortion are especially relevant in cartography. For instance, as discussed in “How to Lie with Maps” by Monmonier [1], map projections like the commonly used Mercator projection distort the relative sizes of countries, making regions near the poles appear much larger than they are in reality. This can strengthen cultural or geopolitical biases, influencing the audience’s perception of importance or power. Because the Mercator distortion is not typically explained explicitly on maps, it risks manipulating entire populations into adopting a skewed worldview.

The image shows four grid visualizations demonstrating different distortion techniques. The top-left panel, labeled "Regular Grid," depicts a standard Cartesian grid with evenly spaced horizontal and vertical lines. The top-right panel, labeled "Fisheye in One Dimension," distorts horizontal grid lines to create a focused region while leaving vertical spacing unchanged. The bottom-left panel, labeled "Cartesian Fisheye View," applies a concentric distortion to the grid, emphasizing the center. Finally, the bottom-right panel, labeled "Polar Fisheye View," warps the grid into a circular shape, resembling a globe.
Figure 4

Although less common, spatial distortion in three dimensions has its uses. As described in “Exploring Large Graphs in 3D Hyperbolic Space” by Munzner [4], hyperbolic space allows massive datasets to be visualized in a compressed area. By magnifying space around a focus node and its immediate connections, while compressing distant regions, it helps users enhance details without losing sight of the overall graph. 

In “Spatial Transfer Functions — A Unified Approach to Specifying Deformation in Volume Modeling and Animation” by Chen et al. [6], spatial transfer functions are used to deform 3D volumetric objects. This enables transformations such as sweeping, stretching, and twisting of spatial fields, facilitating interactive exploration of volumetric datasets. As with any distortion technique, these require careful implementation to avoid confusion. Being very technical, they might demand even greater attention to detail compared to simpler methods. However, ethically, there is probably less likelihood of intentional misuse. These techniques are highly specialized and are not typically used in contexts where they might influence public opinion. Their primary focus is on enabling exploration of complex data for specialists, rather than persuasion or public-facing communication.

A more common way to distort in two- and three-dimensional visualization is through non-linear color mapping. Color mapping is a widely used technique in visualization, where a one-dimensional value space is mapped to a continuous range of colors. This lets users see relative differences in data. However, if the values span several orders of magnitude, a linear mapping might be ineffective in displaying detailed differences. To address this, the value space can be mapped onto a non-linear scale, such as a logarithmic or exponential scale, before the color interpolation is applied. Logarithmic mapping details smaller values by expanding their range, while compressing larger ones, and exponential mapping does the opposite, expanding larger values while compressing smaller ones. It is also possible to bend space around a particular value, in effect applying a lens-like distortion to an interval, to better detail the differences around that value. This could be beneficial for normally distributed values, or other distributions with curved shapes. The use of non-linear color mapping can be very useful when applied appropriately, but it can also lead to misleading interpretations when it is not. 

As Monmonier highlights in “How to Lie with Maps” [1], the choice of scaling and color scheme can manipulate perception, and designers of such visualizations have an ethical responsibility to ensure their design choices do not compromise the audience’s understanding of the data. This does not mean linear color scales are inherently the more accurate or undistorted choice. For instance, when visualizing percentage changes in a population ranging from -100% to +100% on a choropleth map, it might be more suitable to use a non-linear color map. This is because decreases and increases are not linearly proportional: a -50% change (halving the population) is equivalent in magnitude to a +100% change (doubling the population). A linear scale does not show this balance, making decreases look smaller than similar increases. As explained in a tweet by D3 inventor Bostock [8], a diverging logarithmic scale makes sure changes in both directions are shown equally. However, I think that since understanding such a scale might be challenging for lay audiences, using descriptive terms like half and double in addition to numerical values like 0.5 and 2.0 could add to its intuitive interpretation.

This image compares two color scales for data representation. The left scale, labeled "Linear Percentage Scale," transitions from red (-91%) to green (+91%) in even intervals, visually representing changes in percentage. The right scale, labeled "Logarithmic Multiplier Scale," transitions in a similar manner but represents data as multipliers, ranging from x0.39 to x2.54, offering a logarithmic perspective on the data values.
Figure 5

Conveying the mechanism

These examples barely scratch the surface of what is possible with visualization. The virtual world offers limitless potential, unconstrained by the realities of our physical world. Designers possess the power to manipulate space and form, bounded only by their imaginations. As noted earlier, with such power comes a great responsibility to prioritize the pursuit of truth. In “Is Science Losing its Objektivity?” [3], Ziman writes: “scientists are united in `the pursuit of truth’. But some philosophers say that `truth’ is an illusion, whereas others say that it takes many forms….”. This raises the question: if truth is subjective, an interpretation unique to every person, how can we define a unified path towards it? Everyone has their own truth, their own narrative of the world. Should visualizations only align with the audience’s existing truth, reinforcing their current worldview? Or should it aim to challenge their truth, reshaping their perception of reality? The latter inevitably involves manipulation, as it pushes the audience toward a narrative that conflicts with their existing truth. If a visualization inherently alters subjective realities, can it ever avoid biasing? Is attempting to change someone’s truth manipulative, or is it simply part of the process of teaching?

We must accept that engaging in science involves both producing and receiving new narratives. Our worldviews, our subjective truths, will inevitably change as a result of knowledge sharing. This is a fundamental aspect of learning and growth. What is important, however, is recognizing how our worldview is being changed, as it happens. Equally important is our responsibility to convey transparently and accurately to others how we are trying to shift their perception, so they can recognize how their worldview is being reshaped. Any distortion is ethically defensible in principle, as long as the audience clearly understands its underlying mechanisms and how it is altering their perception of the world. This aligns with the recommendations in the referenced studies: if a distortion feels ethically conflicting, rather than entirely excluding it, ensure its mechanisms are thoroughly and transparently conveyed. If, despite thorough explanation, the intended audience still cannot grasp its inner workings, then the ethical responsibility shifts to reevaluating whether the distortion is appropriate at all.

Ultimately, in knowledge sharing, the truth is the opacity of the medium engine itself. Whether it relies on abstraction, filtering, or distortion, the clarity of its mechanism determines its ethical integrity. When the mechanism is obscured, the result is manipulation: information transfer without transparency.

References

1. Monmonier, M. 
How to lie with maps 
University of Chicago Press (1991)

2. Leung, Y. K. & Apperley, M. D. 
A review and taxonomy of distortion-oriented presentation techniques
ACM Transactions on Computer-Human Interaction (TOCHI) 1, 126–160 (1994)

3. Ziman, J. 
Is science losing its objectivity? 
Nature 382, 751–754. issn: 0028-0836 (Aug. 1996)

4. Munzner, T. 
Exploring large graphs in 3D hyperbolic space
IEEE computer graphics and applications 18, 18–23 (1998)

5. Cambridge Advanced Learners Dictionary and Thesaurus 
Cambridge University Press (1999)

6. Chen, M., Silver, D., Winter, A. S., Singh, V. & Cornea, N. 
Spatial transfer functions: a unified approach to specifying deformation in volume modeling and animation in Proceedings of the 2003 Eurographics/IEEE TVCG Workshop on Volume graphics (2003), 35–44

7. Pandey, A. V., Rall, K., Satterthwaite, M. L., Nov, O. & Bertini, E. 
How deceptive are deceptive visualizations? An empirical analysis of common distortion techniques in Proceedings of the 33rd annual acm conference on human factors in computing systems (2015), 1469–1478

8. Bostock, M. 
Twitter Post: https://x.com/mbostock/status/991517711250305024 (2018)
Accessed: December 5, 2024. 

9. Ryan, W. H. & Evers, E. R. 
Graphs with logarithmic axes distort lay judgments
Behavioral Science & Policy 6, 13–23 (2020)

CategoriesData Science

The post Distortion by Design appeared first on Nightingale.

]]>
22769
Exploring Open Source Data to Visualize 99-Cents Stores https://nightingaledvs.com/exploring-open-source-data-to-visualize-99-cents-stores/ Thu, 07 Dec 2023 19:19:35 +0000 https://dvsnightingstg.wpenginepowered.com/?p=19254 I used a business database from my public library, free data visualization tools, and art supplies to create a 3D map of 99-cents stores.

The post Exploring Open Source Data to Visualize 99-Cents Stores appeared first on Nightingale.

]]>

I’ve been personally and professionally interested in 99-cents stores for a long time. Growing up in New York City, 99-cents stores were part of my retail ecosystem. It was a place to pick up everything from duct tape to candles to random cleaning supplies. There’s a ubiquity to them (75% of Americans live within five miles of a Dollar General), but I didn’t know much about them. As part of an art project I started in 2021 with my collaborator Gloria Lau, I mapped 99-cents stores in New York City to visualize a kind of place that’s often invisible. 

As an urban planner, I’ve spent many hours in GIS/QGIS trying to figure out the best way to display information about neighborhoods, land use, climate change impacts, the list goes on. There’s a formality to maps that are created out of necessity: They have to be legible to a variety of users regardless of the person’s level of data literacy or map savviness. Maps also need to be visually consistent. Whenever I made a land use map, my color palette was limited to the standard colors used by the New York City government. 

Working on this art project, I got to merge my urban planner/mapmaker brain with my art-making brain to think about visualization outside of traditional mapping. Part of what made this project possible was my ability to access hard-to-find datasets through the Brooklyn Public Library and through free data visualization tools like QGIS and resources offered by BetaNYC, a civic data organization based in NYC. 

Defining a 99-cents store

Compared with much of the United States which have discount stores that are franchises, 99-cents stores in New York City are majority-independently owned businesses. Therefore, the stores don’t follow a consistent naming convention, and a simple Google search doesn’t produce an irrefutable list. To map these stores, I had to create a working definition of these places to parse through all the different kinds of retail in the city. I defined 99-cents stores as businesses that market themselves as discount stores (ex. “Midwood Discount”, “Discount Deals”) or that have “99 cents” or some variation explicitly in the name (i.e. “Dollar Tree”, “99 Cents and Up”). After doing a quick search of the North American Industry Classification System (NAICS) and Standard Industrial Classification (SIC) codes most commonly used for dollar stores, I used a business search engine through my library to find stores in the city.

With my rough list of 99-cents stores I used a NYC batch geocoder tool from BetaNYC to spatially locate all the results from the search engine. I used Google Street View to spot-check addresses and confirm that the retail spaces were discount stores. When all was said and done I had a list of about 1,300 stores using 2021 data.

Translating data into art

The map provided a straightforward view of 99-cents stores in the city. It also revealed what neighborhoods had the highest concentration of stores. In New York City and nationwide, there’s a documented prevalence of discount stores in communities of color and communities that are in food deserts or food swamps, so much so that some communities have organized to stop their proliferation. Looking at the data, I started thinking of the hills and valleys of 99-cents stores across the boroughs and how I might be able to represent them in 3D form. As part of a larger exhibit, my collaborator and I wanted to present items from 99-cents stores in such a way to have visitors critically look at objects they may not otherwise pay attention to.

Using materials sourced from my local discount store, I created a 99-cents store contour map. After converting my point data into a heat map, I used a contour line tool in QGIS to create a topographical-like map of 99-cents stores. Using the map as a template, I then cut out individual plastic elevations.

The final map was roughly 3-feet by 3-feet using placemats for elevation, a vinyl carpet runner for boroughs, and contact paper as a base layer. My goal in doing this project was to represent dollar stores in an unconventional way, but it also turned into a lesson on data storytelling. I didn’t have to present a perfect dataset but rather share my findings in a way that might make a viewer curious about 99-cents stores in their neighborhood. The barrier to playing with this data was low thanks to my library and open source tools that made my analysis possible. The project has made me more curious about the possibilities of blending data and art and ways to make opaque institutions or systems more transparent through art.


To learn more about 99-cents stores in NYC and nationwide:

  1. Commodity City” : A documentary exploring China’s Yiwu International Trade Market, the world’s largest wholesale market and major supplier for 99-cents stores.
  2. The New York Times : A 2017 article highlighting the stories of immigrant owned 99-cents stores in New York City. 
  3. God’s Garage” : An essay covering the history and expansion of 99-cents stores in the United States.

The post Exploring Open Source Data to Visualize 99-Cents Stores appeared first on Nightingale.

]]>
19254
Visualizing Emotions with Color Cubes https://nightingaledvs.com/visualizing-emotions-with-color-cubes/ Thu, 26 Oct 2023 15:43:11 +0000 https://dvsnightingstg.wpenginepowered.com/?p=18922 Using two different data sets, I leveraged color and 3-D structure to visualize the complexity and breadth of human emotion.

The post Visualizing Emotions with Color Cubes appeared first on Nightingale.

]]>
Emotions, like a wise compass, provide us with invaluable insights into our inner world. As my therapist often reminds me, they carry information that guides us. We need not be ruled by them, but ironically, ignoring them sentences us to their whims. By acknowledging and naming our emotions, we gain the power to choose our response. Emotions exist, whether we want them or not, playing a dominant role in our human lives.

For someone like me, emotions are as complex as a Gordian knot, but I’m certainly not a psychologist; I am a fellow human who experiences them in the typical human way. I’m learning to identify and embrace them, allowing them to shape my existence. It’s no easy task, but the rewards are immeasurable.

Emotions can be both overwhelming and vital. Without them, life would be devoid of color, a monotonous existence. When I think about emotions, it reminds me of color and temperature. This led me to explore how emotions are represented from the perspective of data science — the perspective I’m most familiar with.

Capturing emotions

My first question was how do we capture something as complex as emotions? I searched the web in an attempt to find datasets on emotions. I thought that emotion could be detected from facial expression but also from text. I also thought that one data science task particularly shares some similarities with emotion detection: sentiment analysis. Sentiment analysis could be seen as an extremely coarse grain emotion detection task with only three categories: neutral, positive, and negative.

After perusing the popular data science platform Kaggle, I unearthed an intriguing dataset: “Emotions Dataset for NLP” along with an accompanying article detailing its collection. The task was to classify sentences by six emotions: sadness, anger, surprise, fear, love, and joy, much like the adorable world of Pixar’s animation, “Inside Out.” Although these few emotions also seemed very coarse grained, I decided to give it a try and see how they look.

The journey begins

I became eager to “visualize these emotions” and gain a deeper understanding by organizing the data based on multidimensional representations of their corresponding textual descriptions. Thus began my journey across the emotion space.

How does one truly delve into the depths of this dataset? I was curious about the authors’ decision to select only six emotions out of the countless possibilities. Upon loading the data into a DataFrame, I set out to truly grasp what lay within. The dataset presented me with two primary aspects: textual descriptions (coming from tweets) and corresponding labels reflecting emotions. For instance, I stumbled upon an entry stating, “I feel strong and good overall,” labeled as “joy.”

Rather than embarking on a conventional classification task, where I would train a model to predict emotions based on labeled sentences, I sought to explore the dataset itself and its representation. Leveraging a language model with universal sentence embedding, I transformed each sentence into a long vector of numbers, positioning them within a latent space according to the language model. Employing a dimensionality reduction technique, I extracted the three most informative components from the extensive vector and plotted them. Admittedly, it didn’t offer any groundbreaking revelations. Dimensionality reduction techniques are commonly employed to gain insights into datasets, but even with the added labels, I wasn’t convinced that discernible patterns were emerging.

One lovely tool I really enjoy is tensorboard which lets the user visualize the latent space of the model. With its help I loaded the obtained vector-based representation of descriptions (also called “embeddings”) into tensorboard and applied UMAP dimensionality reduction out of the box. You can see it below.

Visualization of the latent space reduced with the help of UMAP of the Emotions Dataset for NLP with Tensorboard.

Creating emotion cubes

I assigned labels and hover-over descriptions to each data point. Upon closer inspection, I noticed numerous clusters of points huddled closely together. Intriguingly, instead of running a cluster analysis, I became captivated by these individual data points and their neighborhoods. A thought struck me—if I could encapsulate neighboring points within cubes, it might yield fascinating insights when analyzed cube by cube or at least interesting mixtures of “color-emotions.” I wondered how to cover the neighboring points in cubes. Because I worked in a three-dimensional space, I imagined it as a huge storage room that I would fill with small cardboard boxes. Something like in the picture below. In each box we could place emotions that lie together. Of course there won’t be any gravity force to keep the full boxes on the ground but the idea seemed interesting enough to try it out. 

Rough sketch of a 3-D space with “emotion boxes” filling part of it.

 Thus, I divided the latent space into small equally sized cubes and assigned colors to represent the emotions they contained. With a touch of transparency, the cubes came to life — how exhilarating it was. It took me a while to realize that the majority of cubes that filled the space were empty and thus they were covering the emotion cubes, so I decided to remove “empty boxes” and leave only boxes with emotions.

When all the beige cubes of empty space are removed, the remaining emotion cubes are more visible.

Initially, I contemplated coloring each cube with the most dominant emotion it contained. However, as I immersed myself in the project, I realized that certain emotions are more intricate, often comprising a mix of several emotions. I thought, why don’t we look at the six labels as basic building blocks and see what will happen when we blend the basic emotion colors together? Perhaps this blend could offer us unique insights into the emotional landscape. Consequently, I experimented with blending colors in proportion to the composition of emotions within each cube, settling on translating colors into CIELAB space and mixing the dimensions there.

And there we stood, surrounded by multiple little cubes brimming with colors—anger, sadness, joy, love, and everything in between. Inside these cubes, I placed the corresponding data points, allowing the chart to rotate and facilitating exploration of each cube and its contents. You can access it here.

I searched for other data as well, and came across the GoEmotion dataset. This dataset looks at the more complex emotions, it lists 27 of them and a neutral state, as well. We can see below how mixing the colors by emotions makes each cube like no other.

A more nuanced take on emotions, based on the GoEmotion dataset.

Going a step further I thought, what if we use these emotion datasets and basic building block emotions as the initialization and we let the data mix the emotion-colors in the cubes and extract the emotionally inspired color palette at the end? See below for the extracted colors.

Colors extracted with NLP Emotion dataset.
Colors extracted with GoEmotion dataset.

After thinking about palette color I thought about the sound of emotions; one way to expand upon this project would be to add a layer of sonification to the cubes. 

Challenges and reflections

Working on this emotion problem made me wonder about all the different aspects of capturing the data and especially how simplified our models are. I explored individual data points (as “above all show the data” following Edward Tufte) and sometimes wondered how someone labeled them with such emotion. Also, how can you capture such complexity by flattening the emotion to just one sentence, when you cannot hear the voice or perceive the “emotional state” someone was in while uttering the sentence or a sound? I suppose George Box was right again when he said that “all models are wrong, but some are useful” and we should always have that in mind while looking at models.

The post Visualizing Emotions with Color Cubes appeared first on Nightingale.

]]>
18922
Visualizing ChatGPT’s Worldview https://nightingaledvs.com/visualizing-chatgpts-worldview/ Tue, 19 Sep 2023 12:35:40 +0000 https://dvsnightingstg.wpenginepowered.com/?p=18651 Using prompting and data visualization to make sense of Machine Learning systems.

The post Visualizing ChatGPT’s Worldview appeared first on Nightingale.

]]>
How will “prompting” change the way we experience the world? The Artificial Worldviews project is looking to find out. This research initiative is exploring the boundaries between artificial intelligence and society.  Recently, Artificial Worldviews researcher Kim Albrecht inquired GPT-3.5 about its knowledge of the world in 1,764 prompts and mapped out the results.

Check out our Project Page for more information.


The advent of Large Language Models (LLMs) has revolutionized natural language processing and understanding. Over the past years, these models have achieved remarkable success in various language-related tasks, a feat that was unthinkable before. After its launch, ChatGPT quickly became the fastest-growing app in the history of web applications. But as these systems become common tools for generating content or finding information—from research and business to greeting cards—it is crucial to investigate the worldviews of these systems. Every media revolution changes how humans relate to one another; LLMs will have a vast impact on human communication. How will systems such as ChatGPT influence the ideas, concepts, and writing styles over the next decade?

Using ChatGTP to Generate Data

To grasp the situation, I started by methodically requesting data from the underlying API of ChatGPT (GPT-3.5 Turbo) about its own knowledge. The OpenAI Application Programming Interface (API) structures calls into two messages: the user message and the system message. While the user message is similar to the text you enter into the front end of ChatGPT, the system message helps set the behavior of the assistant.

 For the project, I designed the following system message:

You are ChatGPT, a mighty Large Language Model that holds knowledge about everything in the world and was trained on a massive corpus of text data, around 570GB of datasets, including web pages, books, and other sources.

The initial user message was the following:

Create a dataset in table format about the categories of all the knowledge you have. The table should contain at least 30 rows and 10 columns. Pick the dimensions as they make the most sense to you.

I called these requests six times with six different “temperatures”: 0, 0.2, 0.4, 0.6, 0.8, and 1. The temperature determines the randomness of the responses. (A temperature of 0 means the responses will be the same for a given prompt while a temperature of 1 means the responses can vary wildly.) The resulting data file from the six API calls consisted of 31 fields and 425 subfields of knowledge.

From this initial dataset, I designed a recursive algorithm that requested data about fields of knowledge and their subfields, including the humans, objects, places, and artifacts within these categorical systems. 

The core dataset was requested from the OpenAI API in 1,764 requests over the span of three days. Humans and objects were requested separately in all fields and subfields (425). Each of those 850 calls was made twice: once with a temperature of 0 and once with a temperature of 0.5. All requests in the visualization were made to the GPT-3.5-Turbo API. The number of returned items per request varied between five (‘Linguistics’ and ‘Travel Budget’) and 40 (‘Mythology’) rows of data. Due to this inconsistency, some fields hold more items than others. The user message was always the same, asking for the most important humans, objects, places, and artifacts for each field and subfield. So, for example, to get the most important “humans” in the field of “Art” and the subfield of “Film,” the message looked like this:

List the most important humans in ‘Arts’ in the field of ‘Film.’ List their name, kind, category, description, related things, and importance (0 – 100) as a table.

The final dataset contained over 18,000 objects, items, humans and places. Reading through the dataset feels in between reading an encyclopedia, a strange short story from Jorge Luis Borges and a dada artwork. It is a vast dataset containing entries from ‘​​Hair follicles’  to ‘Demonic Possession.’ Trivia is mixed with essential figures and moments of history, facts and fictions intermingle. I am still wondering if GPT ‘hallucinated’ something within the dataset. Thus far I have not found anything within the dataset that does not reference a concept that exists in our world, but there might be.

I am understanding the data as the following: The generated data does not represent an unbiased picture of the knowledge inherent in GPT-3. Instead, it is a confluence of three forces: first, a representation of how the LLM handles the request; second, a perspective on the underlying textual training data; and third, a reflection of the political sets and settings embedded within the artificial neural network.

Visualizing the ‘Knowledge’ of ChatGTP

To visualize the data I tried various methods all based on the same two-fold process: turning text into numbers (text-to-vector) and reducing these numbers into two dimensions that can be visually represented on an x-axis and a y-axis (T-sne, LDA, and UMAP). The results of these experiments have been rather unsuccessful.

In a second approach I calculated network similarities. The dataset consists of four layers. The first two layers are the field (31) and subfields (425) categories of  GPT’s knowledge. The third layer consists of 7,880 items representing the core dataset of the project, including people, objects, places, etc. that GPT-3.5 named in the API requests with a full description. The fourth layer consists of 24,416 items that GPT-3.5 named as related items to the core items of the third layer.

To calculate a network structure from the data I connected fields to subfields, and objects and humans connect by co-mentions in multiple fields. Thus, in the resulting map, objects and humans cluster together by similarity. 

One intriguing problem was the positioning of the text within the visualization. The question I had was: How to find the best position for the labels of fields and subfields? After various trials I came up with the idea of using a regression analysis to position the text. Regression analysis is a statistical method that shows the relationship between two or more variables. For each field and each subfield I calculated the polynomial regression curve and positioned the text onto this line.

Due to the network structure of the visualization some fields and subfields had outliers that distorted the regression too much. To account for this I removed the 5%  most outlying points from each of the calculations. As the x and y positions of the points do not contain relationships between a dependent variable and an independent variable, conducting a regression analysis is not particularly meaningful. However, to position the text and display their positional relationship to one another, the technique worked well.

The unforeseen finding

My research questions have been manifold and only deepend throughout the process of working with the data and the visualization. The project prompts a deeper inquiry into the nature of artificial intelligence, its boundaries and political constraints. 

First, I found myself intrigued by the possibility of probing this novel method as a means to understand AI systems. What can we learn from the iterative and methodical requesting of data from large language models? Is it a mirror reflecting our human intellect or an entity with its own inherent logic? Second, we are interested in questions that pertain to the dataset itself: What are the biases of the system? Are there fields that stand overrepresented or underrepresented, and what does that signify about our collective online text corpus? How diverse will the dataset be, and what can that diversity teach us about the breadth and limitations of machine learning? But investigating all these questions within this article would lead to a book length article and thus I am constraining myself to the one finding I was mostly intrigued by.

One of the most striking features of the dataset is simply counting the number of times GPT named things. The bar chart shows the most frequently named items in the dataset.

First of all, the entire list of the most named things consists only of humans. Secondly, the list is led by Rachel Carson and Jane Goodall. Rachel Carson is known for her book Silent Spring (1962) and for advancing the global environmental movement. Jane Goodall is considered the world’s foremost expert on chimpanzees. An American marine biologist and an English primatologist and anthropologist are the two most named figures within the project.

In comparison, the Pantheon project ranks people, among others, by the number of Wikipedia language editions and count of article clicks. In this ranking, the first female is Mary, mother of Jesus, at rank 33 (accessed on 7th of August 2023). Muhammad, Isaac Newton, and Jesus are the top-ranked figures within the Pantheon project. To make sense of these perhaps counterintuitive ranking results, it is important to note how the data was generated. Fields and subfields were requested through 1,764 API calls. Rachel Carson was listed 73 times within the 1,764 calls. For a person, object, place, etc., to be named frequently, GPT-3.5 needs to name it in as many combinations of categories and subcategories as possible. Thus, high-ranking results from spreading into many categorical systems.

So far we do not know why and how a marine biologist, writer, and conservationist (Rachel Carson), the world’s foremost expert on chimpanzees (Jane Goodall) as well as a Kenyan social, environmental, and political activist (Wangari Maathai) became so central in the map. Leonardo da Vinci, Charles Darwin, Albert Einstein, Alan Turing, Elon Musk, Galileo Galilei, Karl Marx, William Shakespeare, Winston Churchill, Carl Sagan, Sigmund Freud, Mahatma Gandhi, and Nelson Mandela are all named less frequently than these three women. 

The question becomes: Are Rachel Carson and Jane Goodall individuals whose research spreads especially well? Research that transcends fields and categories? Or is something else happening here? There are  a number of possible explanations.  OpenAI may set certain parameters that lead to these results. Or, prompt engineers could be pushing certain perspectives to become more visible or, maybe, GPT-3 “cares” a lot for the planet and the environment. At this point, this is hard to say and would need a much deeper investigation than these preliminary findings.

  1. The dataset can be viewed and downloaded here. Anyone is welcome to use the dataset. Feel free to reach out to Kim Albrecht with questions, comments, analysis, or visualizations based on the dataset.

Artificial Worldviews is a project by Kim Albrecht in collaboration with metaLAB (at) Harvard & FU Berlin, and the Film University Babelsberg KONRAD WOLF. The project is part of a larger initiative researching the boundaries between artificial intelligence and society.

Feel free to reach out to Kim Albrecht with questions or comments kim@metalab.harvard.edu.

CategoriesData Science

The post Visualizing ChatGPT’s Worldview appeared first on Nightingale.

]]>
18651
Decoding the Manhattan Project’s Network: Unveiling Science, Collaboration, and Human Legacy https://nightingaledvs.com/network-diagram-the-manhattan-projects-network/ Tue, 12 Sep 2023 14:58:08 +0000 https://dvsnightingstg.wpenginepowered.com/?p=18536 A data viz that unveils the science and human legacy of the Manhattan Project, one of the largest ever scientific collaborations.

The post Decoding the Manhattan Project’s Network: Unveiling Science, Collaboration, and Human Legacy appeared first on Nightingale.

]]>
The Manhattan Project was one of the largest scientific collaborations ever undertaken. It operated thanks to a complex social network of extraordinary minds and it became undoubtedly one of the most remarkable intellectual efforts of human history. It also had devastating consequences during and after the atomic bombings of Hiroshima and Nagasaki. Despite the loss of hundreds of thousands of human lives during the bombing and the subsequent events, the scientific journey itself stands as a testament to human achievement, as highlighted in Christopher Nolan’s film portrayal of Oppenheimer. 

The scientific literature on collaboration, particularly the role of network connections in achieving success, is robust and has been further enriched by the current data boom. This wealth of data, represented by for instance millions of scientific papers, is exemplified in works such as “The Science of Science,” by D. Wang and A. L. Barabási [1]. Utilizing network analysis to uncover the intricate connections within the Manhattan Project aligns with my perspective as a physicist turned network scientist. Without further ado, here’s how I mapped the Manhattan Project into data and used that to create a network visualization of this historically significant collaborative project.

Collecting data

As with many data science projects, the first question revolved around data selection. While scientific publication data might seem logical, given the project’s scientific nature, this approach proved inadequate. The main reason for this was two-fold: First, some of the most important documents and papers could still be classified; and also, not everyone was active in science, as the operation was also heavily intertwined with politics and the military. Thus, resorting to collective wisdom, my focus shifted to Wikipedia, a global crowdsourced encyclopedia and a potential data source. Wikipedia offers a list of notable personnel connected to the project [2], encompassing more than 400 contributors from various fields. I used  a straightforward web-scraping technique to collect data from Wikipedia—a total of 452 usable profiles. Then I manually categorized each person based on occupation, leading to the distribution outlined in the following table.

The list, not entirely surprisingly, is topped by physicists, followed by chemists and engineers. However, exploring the realm of science, particularly those at the forefront of the Project, awaits. Let’s take the stories from the “Other” category. This group collects contributors’ primary occupations that appeared infrequently and seemed unrelated to a scientific project focused on weaponry development. Among these unconventional contributors are Wolfrid Rudyerd Boulton, an American ornithologist, who also happened to become responsible for monitoring the supply of uranium ore from the Belgian Congo, and Edith Warner, a tea room owner in Los Alamos whose role was said to have profoundly impacted researchers’ morale.

Some other notable “other” figures include Charlotte Serber, a journalist, statistician, librarian, and the sole female laboratory group leader in Los Alamos. Ben Porter defies categorization, too, embracing roles as an artist, writer, publisher, performer, and physicist—later exhibiting work at New York’s Museum of Modern Art. The selection concludes with James Edward Westcott, a notable Manhattan Project photographer, and Donald Lindley Harvey, a professional basketball player turned Army member contributing to the project.

Constructing the network

With the data in hand, I picked network science [3], the science of connections that is perfect for elegantly deciphering complex structures such as the Manhattan Project’s collaboration patterns. Each network comprises nodes (entities) and links (references) that weave the intricate social fabric of the collaborating people. In this context, each node symbolizes a Manhattan Project contributor, with links forming between individuals whose Wikipedia pages reference one another. The number of shared references determines the link’s strength. Employing this straightforward framework, I arrived at a network of 316 individuals connected by 1,099 ties of various strengths.

A network diagram, showing bubbles and connecting lines on a black background. The bubbles, or nodes, are different sizes, as are the connecting lines. All the lines and nodes are a gray color. 50 of the largest nodes are labeled with a name.
Figure 1. The collaboration network behind the Manhattan Project. Each node represents a contributor, where two nodes are linked if their Wikipedia pages reference each other. The top 50 nodes with the largest number of connections are labeled.

Infusing color into insight

The next phase enriches the network visualization by introducing color—each hue representing a distinct network community or cluster. Defining these communities hinges on the methodology, but the general premise remains: Communities consist of nodes with a higher density of internal links than external ones [4, 5]. In other words, nodes mostly linked to each other—as opposed to the rest of the network—belong to one community. The resulting visual, presented in Fig. 2, uncovers how contributors organize into closely connected clusters within the expansive Manhattan Project. In this Figure, each color encodes different communities.

The collaboration network behind the Manhattan Project-- same as the grayscale version, but where each node is now colored based on the network community it belongs to. The largest nodes belong to Enrico Fermi, Niels Bohr, J. Robert Oppenheimer, Ernest Lawrence, Arthur Compton, Leo Szilard, Eugene Wigner, Edward Teller, Hans Bethe, Richard Feynman and Robert Bacher.
Figure 2. The collaboration network behind the Manhattan Project shown in Figure 1, where each node is colored based on the network community it belongs to.

Deciphering the network’s narrative

With the vibrant visualization in Figure 2, we are ready to read the collaboration network. Key players in modern physics pop out immediately, including Nobel laureates Arthur Compton, Enrico Fermi, Niels Bohr, and Ernest Lawrence, alongside geniuses like J. Robert Oppenheimer and Edward Teller. Yet, there is much more to the story and the patterns behind the connections than just a handful of hubs.

At the core of this network diagram lies the red community centered by the legendary Niels Bohr. Here, Bohr’s connections reveal his instrumental role in supporting refugee scientists during World War II, who also joined the Project, including people like Felix Bloch, James Franck, and George Placzek, all marked by red. Adjacent to Bohr’s realm resides a green cluster, highlighted by the Italian physicist Enrico Fermi. Fermi, together with his collaborators like Anderson, Szilárd, Compton, and Zinn, reached the milestone of the self-sustaining chain reaction using uranium and gave birth to the first nuclear reactor, the Chicago Pile-1.

A close-up of the network diagram.
Figure 3. A close-up of the collaboration network behind the Manhattan Project colored by network communities shown in Figure 2, where each node is labeled.

While Eugene Wigner was most famous for his contribution to Chicago Pile-1, his links tie him closer to the purple community that seems to be scattered around the network. Wigner can be seen prominently in the upper-right corner of the network. This more decentralized community, having no one else but Oppenheimer as its key figure, also links the famous Mathematician John von Neumann, with purple, in the top-center part of Figure 3, who. (He, along with Wigner was unfortunately left out of the blockbuster movie by Nolan.) With purple, we see several other leading scientists, such as James Chadwick in the bottom-center, who led the British team on the Project; Robert Wilson right next to Oppenheimer, who became the head of its Cyclotron Group; and the American physicist Robert Serber directly above Oppenheimer, who created the code names for all three design projects and the bomb, such as “Little Boy” and “Fat Man.” Finally, a few words about the gray cluster, which turned out to be the Theoretical Division, with stars like Edward Teller in the center, and Nobel laureates Richard Feynman (my personal favorite scientist) in the top left, and Hans Bethe in the center.

One last observation to a personal accord: At first sight, the connections between the Hungarian immigrant Martians [6] Teller, Wigner, Szilard, and Neuman were hard to spot, despite their foundational role in the dawn of the atomic era and countless joint projects. However, once I highlighted them on the network, my expectations were quickly confirmed. They are all closely linked though not exclusively, meaning that they were also very well embedded in the American scientific community at that time. This is probably best illustrated by the so-called Einstein Szilard letter, written by Szilard who also consulted with Teller and Wigner, and which was ultimately signed by and sent to President Roosevelt by Einstein. A fun fact about this letter: during those days, Einstein was spending his vacation on the beach, so Szilard visited him right there. And as Szilard didn’t own a driver’s license, Teller was driving him [7].

Aversion of the network diagram, most of it in black and white. A few are in pink, showing the group of so-called Martians (Edward Teller, Eugene Wigner, Leo Szilard, and John von Neuman). The bubbles, or nodes, are different sizes, as are the connecting lines. In this particular group, Edward Teller's node is largest. The nodes in this version are not labeled, except for a few that are in pink.
Figure 4. A variant of Figure 2 highlighting the Martians – Edward Teller, Eugene Wigner, Leo Szilard, and John von Neumann.

Closing

Beyond the pages of history, the project embodies the convergence of human endeavor—distinguished minds across varied disciplines united for a common goal. This analysis sheds some light on the complex patterns of collaboration and joint efforts that allowed such great minds to connect, work in teams, and succeed at such an enormous scale. Additionally, the way I built this network illustrates how network science can be applied to nearly any social system, quantitatively capturing the invisible relations, and helping to interpret the hidden patterns underneath.

Disclaimer

Several parts of this text were upgraded by AI tools, namely, Grammarly and ChatGPT 3.5, while the whole text was initially drafted and later updated by the human author.


This article was edited by Kathryn Hurchla.


References

[1] The Science of Science, Dashun Wang, Albert-László Barabási, Cambridge University Press, 2021
[2] https://en.wikipedia.org/w/index.php?title=Category:Manhattan_Project_people
[3] Network Science by Albert-László Barabási,  Albert-László Barabási, Cambridge University Press in 2015.
[4] A Network Map of The Witch, Milan Janosov, https://dvsnightingale.wpenginepowered.com/a-network-map-of-the-witcher/ 
[5] Blondel, Vincent D., et al. “Fast unfolding of communities in large networks.” Journal of statistical mechanics: theory and experiment 2008
[6] https://en.wikipedia.org/wiki/The_Martians_(scientists)
[7] Marx György: The voice of the Martians

The post Decoding the Manhattan Project’s Network: Unveiling Science, Collaboration, and Human Legacy appeared first on Nightingale.

]]>
18536
The Best Day… To Buy a Taylor Swift Ticket https://nightingaledvs.com/the-best-day-to-buy-a-taylor-swift-ticket/ Tue, 08 Aug 2023 12:40:01 +0000 https://dvsnightingstg.wpenginepowered.com/?p=18145 A Taylor Swift fan with little hope of buying a concert ticket used her coding and data viz skills to make her Wildest Dreams come true.

The post The Best Day… To Buy a Taylor Swift Ticket appeared first on Nightingale.

]]>
When presale tickets for Taylor Swift’s Eras tour were released in November 2022, Ticketmaster’s website was woefully overwhelmed. The site crashed, bots snatched up tickets, and millions of Taylor Swift fans, after waiting hours in a queue, were left empty handed. As the digital dust settled, the bad news only continued. Not only did Ticketmaster cancel their general sale (due to dwindled ticket inventory), but the only available tickets were listed for up to 20 times their original price on resale markets.

For context, I have been listening to Taylor Swift since I was 12. Over the years, her music has been the soundtrack to my heartbreak, my happiness, and my growth into womanhood. I have cried, laughed, and belted out to all her songs. And if I don’t sound like a truly mad Swiftie by now, I can say with confidence that seeing her Reputation tour alongside two of my best friends was the best night of my life.

Suffice to say, I could not fathom not seeing her Eras tour.

Looking back on it now, I realize (as ridiculous as it sounds) that I passed through something like the five phases of grief in my search for Taylor Swift tickets. Denial and anger after the initial Ticketmaster fiasco; bargaining as I scoured Facebook and Twitter for resale tickets; depression when I realized there were millions of people just like me, many of whom were being scammed; and, finally, acceptance when I resigned myself to buying marked-up tickets on a reputable site like SeatGeek or StubHub.

At this point it was March, and I was looking to buy tickets for the show nearest to me (MetLife stadium in late May). My only remaining question: was there an optimal time to buy tickets? Was it now? Would tickets only become more expensive? Or was there an intelligible pattern to decode? A reasonable way to buy unreasonably priced tickets?

Ticketmaster, Look what you made me do

To answer these questions, I consulted my inner dataviz engineer. After realizing that manually checking prices on StubHub and SeatGeek was unsustainable, I began doing research on their APIs. SeatGeek, compared to StubHub, had more documentation and sample code available online to access their API, which provided aggregated pricing metrics for each show.

So for example, on March 22nd, I started pulling the average, median and lowest prices of all SeatGeek listings for the April 13th show in Tampa. Repeating this until the day of the concert, I would be able to see trends in day-to-day ticket pricing, and not only for Tampa, but for every city and every date on the Eras tour. Initially, I manually added each day’s data to this ongoing dataset, but, for obvious reasons, then wrote Python code that grabbed the day’s data from SeatGeek and wrote it to a Github repository (thanks ChatGPT!).

A screenshot of python code that grabs information from the API on Taylor Swift concerts, including data, state, city, average price, lowest price, highest price, visible listing count, median price, and other metrics.
Sample data pulled on March 22 from SeatGeek’s API showing pricing for the April 13 show in Tampa.

Now here it was: the moment when I’d truly use my dataviz skills for a noble cause. I would visualize this data to see, at a moment’s glance, trends in pricing, and to determine the exact moment when I should buy my tickets. 

What happened next is what I would consider a developer’s dream–the first version of the viz became (more or less) the final version. Creating a data visualization typically involves multiple cycles of design and development, spurred by user testing, to ensure that the final product meets the audience’s needs. Throughout a project I usually have a running list of to-do’s and bug fixes. And this, of course, is as it should be.

But this pet project, arguably the least intuitive and worst visualization I’ve made so far, had a very particular audience with very particular needs (It’s me, hi, I’m the user it’s me!). With limited time, I did not fuss over clear axes or helpful explanations. I did not fret over the ugly UI, lack of a mobile-friendly design or (relatively) non-breaking bugs. All the extra care I’d usually apply to making my viz universal (no doubt the crux of our profession), was put into servicing its basic functionality. This was the first time I’d created such a simple and intimate project, and with that came a liberating joy.

A very crude bar chart with no labels or informative makers. There are many pop-ups on the bars showing the date, median price, listings and city, state. But they all overlap each other.
Bug when hovering on bars while showing a date with missing future data. 

The result … Ready for it?

So how did I use this viz? Now that I’m writing for a larger audience, a longer explanation seems due.

Annotated visualization, showing pricing and listing data for Taylor Swift tickets on Wed. June 7. The data show the number of listings and the ticket price as bars. The x-axis is time and the bar colors distinguish past concerts from future concerts.
Annotated visualization, showing pricing and listing data on Wed. June 7.

In the image above, each set of positive and negative bars represents a show on a particular date. The horizontal axis represents time, the positive vertical axis represents the selected ticket pricing metric (average, median, lowest, etc.), and the negative vertical axis represents the number of SeatGeek listings for that show’s date. Shows typically take place Friday to Sunday. Each bar’s height represents the pricing metric (for positive bars) or number of listings (negative bars) as of the date represented by the red line. This date can be adjusted using the range slider to see the general pricing trend of all shows over time. Hovering on a bar shows the historic pricing for that specific show. 

A line chart showing the median price for the concerts at East Rutherford, NJ. The median price drops from $5,269 on Thursday May 25 to $2,978 on Friday, May 26.
Historic median price for the Sun. May 28 show at MetLife Stadium in New Jersey, showing a 43% decrease in median price the week before the concert.

Play with the live viz here

So what did I divine from all this work? Generally, I noticed that prices increased in the weeks before a show, with a spike occurring in the middle of the week leading up to the show. Then, however, something interesting happened: On the Friday before a show, prices tended to drop dramatically. Culling through Swiftie Facebook groups and Twitter accounts, I realized that this was caused by tickets that Ticketmaster was releasing the very weekend of the concerts. Unsurprisingly, many of these tickets were then immediately posted for resale on SeatGeek, thus increasing supply and decreasing the price of tickets. Since buying the face value tickets released from Ticketmaster would be near impossible (though of course I’d try), this would have to be my repurchase window–at the eleventh hour, the very weekend of the shows. Though waiting till the last minute seemed risky (and oh, how that wait gave me a few extra grays), I decided to trust my visualization.

When Ticketmaster released additional tickets two days before the concert, I bought a resale ticket on  SeatGeek for the May 28th MetLife show. The ticket—eye-watering transaction fee included—was not cheap. And by ‘not cheap,’ I mean it was expensive—as in, ‘a month’s rent in New York’ expensive, or, for the more mathematically inclined, ‘add an extra zero to the original price’ expensive. As my adrenaline waned, a sobering reality set in. What had I just done? Had I really spent all that money in one fell swoop? And what if the concert turned out to be just like any other? What if it failed to meet my impossible expectations?

All weekend long, I questioned my decision, sick with both buyer’s remorse and that hopeful malady known as excitement.

Buyer’s remorse? Shake it off

But to say that the show didn’t disappoint is a vast understatement. It was the best night of my life (yes, humbly dethroning my previous best night, at her Reputation concert). Her singing was flawless, her performance intimate. The stage sets were immersive and grand, the lighting mesmerizing and psychedelic. From nosebleed seats, a normally disappointing bird’s eye view was transformed into a unique perspective of coordinated visual effects. In ‘Mastermind,’ she sings, ‘Checkmate, I couldn’t lose,’ and at one point a shifting chessboard was projected onto the stage floor, with dancers standing in for chess pieces—a sight unavailable to the fortunate few with floor seats. Everything—the lights, dancers, and sets—was coordinated in a manner that transcended a normal concert and approached something closer to a Broadway show or, as a devout Swiftie might say, a religious experience.

Image of Taylor Swift on stage with a giant screen behind her showing her singing into a microphone.
Author’s image, May 28 at MetLife Stadium.

As she traversed the eras of her career, so I traversed the eras of my life. When she sang about losing her grandmother in ‘marjorie, I looked up at the sky and fought back tears thinking of my Grandpa. When she sang ‘Shake It Off,’ I recalled belting out the very same lyrics with my college roommate as we commiserated over stupid boys. When she crooned about the first fall of snow in ‘All Too Well,’ I suddenly remembered leaving a college party late one night and being struck by the sight of snow falling fresh in New York City. I remembered how the streetlights had glowed with an aura of snowflakes; how I had listened with uncanny amazement to the unusual silence; and how, upon seeing the magical sight, I shared a moment of truce with a guy I was on the rocks with.

And here’s the thing: I wasn’t the only one having this experience. It was as if she touched every person in that stadium of 80,000. When she sang ‘betty,’ a recent song from folklore, I was shocked by the teenage girls around me who shouted along. The song, about a high school love triangle, was one where, despite loving the music, I’d found the lyrics a bit immature. But now I realized that Taylor, while maturing in her musical themes, still made an effort to connect with a younger audience, much in the same way that ‘The Story of Us’ had connected with me a decade earlier. And hearing it again, ‘betty’ became clever in a way that her earlier songs weren’t, incorporating intentional storytelling that deviates from her usual autobiographical style. (My turn to scream came a bit later, during the most recent era of her life, when the lyrical themes shifted from young love and heartbreak to the competing obligations of a career, relationships, and societal expectations.)

But the most touching moment of the concert occurred when, in the surprise acoustic section, Taylor sang ‘Welcome to New York,’ a synth-pop anthem from her 1989 album. At home, I’d normally skip this song, finding its beat a little too relentless. But hearing it intimately stripped down to her voice and her strummed guitar chords, I realized that my journey to standing in that stadium began much more than a few months ago.

I moved, not just to New York City, but to America 10 years ago. It was as far removed from a small Caribbean island as it was possible to be, and I distinctly remember the initial feeling of panic. Through the ups and downs, I made this my second home. And those ups and downs proudly mark the NYC era of my life. I made best friends and met the love of my life while surviving the stress of my undergraduate engineering degree. I struggled through multiple job hunts and a career pivot, but now get to do what I love every day (moving through appropriate design-development iterations of course!). It was a new soundtrack and I did dance to this beat – still do. 

So after this experience, I can see why Eras tour prices have only kept increasing over time… I may or may not be updating my visualization to keep an eye on future ticket prices…

Editing support: Rob Aldana

The post The Best Day… To Buy a Taylor Swift Ticket appeared first on Nightingale.

]]>
18145
What ChatGPT (and Humans) Say About Data Science Trends https://nightingaledvs.com/chatgpt-data-science-trends/ Thu, 09 Mar 2023 14:29:55 +0000 https://dvsnightingstg.wpenginepowered.com/?p=16157 Milán Janosov asked ChatGPT about data science. Then, he analyzed Twitter to see what humans think about the topic. Here's how they compare.

The post What ChatGPT (and Humans) Say About Data Science Trends appeared first on Nightingale.

]]>
“What will be the biggest data science trends in 2023?”

First of all, why would I ask artificial intelligence tool ChatGPT, a question about data science? Well, there are several personal reasons. Since my undergraduate years of studying physics, I have been deeply fond of Isaac Asimov and his Foundation series. Later, as I did my PhD in data science, I realized how close data science and Asimov’s psychohistory actually are—using quantitative tools to understand and forecast human behavior at scale!

This fascination led to a language analytics language processing research project where I dealt with Asimov’s books as a stream of textual data in 2021. As Asimov also contributed significantly to the principles of robotics, it was an appropriate gesture to incorporate this fancy new tool called “GPT-3” by OpenAI—a text completion engine. GPT-3 was already trained on a vast amount of text produced by humans and performed reasonably well at completing entire paragraphs based on as little as a single sentence. So I added a few paragraphs of GPT-3’s text to the research paper. Those paragraphs worked fairly well in the text, but honestly, did not add too much novelty.

A few months after, as the new year of 2022 kicked in, I went to the OpenAI platform again and asked GPT-3 about the major trends of data science for the new year. On the one hand, these prediction sentences worked quite well. On the other hand, they were very vague and generic, and closer to horoscopes than actual forecasts. For example, the points were outlined in brief sentences, not elaborated paragraphs, and they were generic enough that some points were even slightly inconsistent with each other.

So after all this, I was very curious to see the new fuss around the latest tool of OpenAI, ChatGPT. At first glance, this chatbot really knows something! Passing medical and law school final exams, generating pretty color palettes, and so many more applications. In early 2023, I followed my tradition and asked the AI bot the big question again:

What will be the biggest data science trends in 2023?

1. Automation and AI-driven analytics: Automation and AI-driven analytics will continue to play a major role in data science trends in 2023. This will allow organizations to process and analyze large amounts of data quickly and accurately, enabling them to make better decisions and increase their efficiency.   

2. Augmented analytics: Augmented analytics is the combination of machine learning and natural language processing to help data scientists visualize data quickly and effectively. This technology will allow data scientists to quickly identify and explore patterns and trends in data, enabling them to make faster and more informed decisions.   

3. Explainable AI: Explainable AI is the ability to explain the results of a data science model to non-technical audiences. This technology will help bridge the gap between data scientists and non-technical decision makers, allowing them to understand the results of data science models and make better decisions.   

4. Edge computing: Edge computing will become increasingly important in 2023, as more and more devices will be connected to the internet and data will need to be processed in real-time. This will allow data scientists to quickly process and analyze data from devices at the edge of the network.   

5. Data privacy and security: As more and more data is collected, stored and analyzed, the importance of data privacy and security will become increasingly important. Data scientists will need to ensure that data is secure and that privacy laws are adhered to.

While at this point it’s pretty hard to judge how accurate ChatGPT’s data science predictions are, they seem to cover a lot more technical terms and are a lot more elaborate than the previous year’s forecasts from GPT-3. This makes a large technological upgrade pretty clear.

That said, comparing GPT-3 and ChatGPT isn’t apples to apples. ChatGPT comes from the same line as GPT-3; they both belong to the so-called Generative Pre-trained Transformer language model, a deep learning framework designed to produce human-like texts. While ChatGPT builds on GPT-3.5, GPT is a general-purpose system that excels in various functions (from text generation to machine translation), while ChatGPT was explicitly designed to be a Chatbot, that is, chatty and able to hold up conversations and answer in a way that feels more natural for us.

The chatbot only has a few points regarding the future of data science, human knowledge is out there on numerous channels even today, like on Twitter. So in addition to asking ChatGPT about what’s on the horizon for data science, I sampled what humans are saying about that topic, by analyzing thousands of tweets and hastags related to data science. Then, I visualized them on a network map, shown below.

An image of colorful network nodes on a black background, illustrated based on 10,000 tweets containing the hashtag #datascience over the last two weeks of 2022.
The hashtag network I designed by collecting and processing approximately 10,000 tweets containing the hashtag #datascience during the last two weeks of 2022. After downloading the tweets with TweePy, I extracted the hashtags from each tweet. Then I built the hashtag network where each node represents a hashtag, and two nodes are connected if they were co-tweeted. To keep the backbone of the network with the most important nodes and links, I also applied a final edge filtering step. Additionally, I coloured the nodes based on network communities, also known as strongly interconnected subgraphs.

This data science snapshot shows that big data analytics still rules the world, with AI and ML in the center. (The figure also tells us that the data collection overlapped with 2022’s #100daysofcode.) What’s very interesting to see is that ChatGPT proposed pretty important matters, such as explainable AI and data privacy, which are nowhere among the major topics (network nodes), except maybe cybersecurity.

ChatGPT might hint at something on the network map as it forsees the rise of augmented analytics, but it didn’t mention blockchain, which was a moderately buzzy hashtag on Twitter.

Of course, the real question is still whether ChatGPT is producing smart combinations of existing pieces of information or the machine has inferred something we humans haven’t even thought of. For that, we’ll just have to wait and see.

The post What ChatGPT (and Humans) Say About Data Science Trends appeared first on Nightingale.

]]>
16157
FIFA World Cup 2022 – The Network Edition https://nightingaledvs.com/fifa-world-cup-2022-the-network-edition/ Fri, 23 Dec 2022 14:00:00 +0000 https://dvsnightingstg.wpenginepowered.com/?p=14626 After a long qualifying process packed with surprises (Italy missing out as the reigning European champions) and last minute drama (both Egypt and Peru missed..

The post <strong>FIFA World Cup 2022 – The Network Edition</strong> appeared first on Nightingale.

]]>
After a long qualifying process packed with surprises (Italy missing out as the reigning European champions) and last minute drama (both Egypt and Peru missed out on penalties), the FIFA World Cup 2022 kicked off on the 20th of November in Qatar. With 32 countries and over 800 players representing nearly 300 clubs globally, it measured up to more than 12 billion EUR in the players’ current estimated market value total. In this short piece, we explore what the small and interconnected world of football stars looks like.

Data

We are data scientists with a seasoned football expert on board, so we went for one of the most obvious choices of the field – www.transfermarkt.com. We first wrote a few lines of Python code to scrape the list of participating teams, the list of each team’s players, and the detailed club-level transfer histories of these players arriving at the impressive stats of our intro by comprising the complete transfer history of 800 players, measuring up to 6,600 transfers and dating back to 1995 with the first events.

Club network

The majority of players came from the top five leagues (England, Spain, Italy, Germany, and France) and represented household teams such as Barcelona (with 17 players), Bayern Munich (16), or Manchester City (16). While that was no surprise, one of the many wonders of a World Cup is that players from all around the globe can show their talents. Though not as famous as the ‘big clubs’, Qatari Al Sadd gave 15 players, more than the likes of Real Madrid or Paris Saint-Germain! There are, however, great imbalances when throwing these players’ market values and transfer fees into the mix. To outline these, we decided to visualize the typical ‘migration’ path football players follow – what are the most likely career steps they make one after the other? 

A good (and referenceable) way to capture this, following the prestige analysis of art institutions, is to introduce network science and build a network of football clubs. In this network, every node corresponds to a club, while the network connections encode various relationships between them. These relationships may encode the interplay of different properties of clubs, where looking at the exchange of players (and cash) seems a natural choice. In other words, the directed transfers of players between clubs tie the clubs into a hidden network. Due to its directness, this network also encodes information about the typical pathways of players via the ‘from’ and ‘to’ directions, which eventually capture the different roles of clubs as attractors and sinks.

To do this in practice, our unit of measure is the individual transfer history of each player, shown in Table 1 for the famous Brazilian player known simply as Neymar. This table visualizes his career trajectory in a datafied format, attaching dates and market values to each occasion he changed teams. His career path looks clean from a data perspective, although football fans will remember that it was anything but – his fee of EUR 222M from Barcelona to PSG still holds the transfer record to this day. These career steps, quantified by the transfers, encode upgrades in the case of Neymar. In less fortunate situations, these prices can go down signaling a downgrade in a player’s career. 

Table 1. The datafied transfer history of Neymar.

Following this logic in our analysis, we assumed that two clubs, A and B, were linked (the old and new teams of a player), if a player was transferred between them, and the strength of this link corresponded to the total amount of cash associated with that transaction. The more transactions the two clubs had, the stronger their direct connection was (which can go both ways), with a weight equal to the total sum of transfers (in each direction). In the case of Neymar, this definition resulted in a direct network link pointing from Barcelona to Paris SG with a total value of EUR 222M paid for the left winger.

Next, we processed the more than six thousand transfers of the 800+ players and arrived at the network of teams shown in Figure 1. To design the final network, we went for the core of big money transactions and only kept network links that represented transfer deals worth more than EUR 2.5M in total. This network shows about 80 clubs and 160 migration channels of transfers. To accurately represent the two aspects of transfers (spending and earning) we created two versions of the same network. The first version measures node sizes as the total money invested in new players (dubbed as spenders), while the second version scales nodes as the total money acquired by selling players (dubbed as mentors).

Figure 1. The network of the top football clubs based on the total amount of money spent and received on player transfers. Node sizes correspond to these values, while node coloring shows the dominant color of each club’s home country flag.

Spenders

The first network shows us which clubs spent the most on players competing in the World Cup, with the node sizes corresponding to the total money spent. You can see the usual suspects: PSG, the two clubs from Manchester, United, and City, and the Spanish giants, Barcelona, and Real Madrid. Following closely behind are Chelsea, Juventus, and Liverpool. It’s interesting to see Arsenal, who – under Arteta’s management – can finally spend on players, and Bayern Munich, who spend a lot of money but also make sure to snatch up free agents as much as possible.

Explore these relationships and the network in more detail by looking at Real Madrid! Los Blancos, as they’re called, have multiple strong connections. Their relationship with Tottenham is entirely down to two players who played an integral part in Real Madrid’s incredible 3-year winning spell in the Champions League between 2016 and 2018: Croatian Luka Modric cost 35M, and Welsh Gareth Bale cost an at-the-time record-breaking 101M. While Real Madrid paid 94M for Cristiano Ronaldo in 2009 to Man Utd, in recent years there was a turn in money flow, and United paid a combined 186M for three players: Ángel Di María, Raphael Varane, and Casemiro. They also managed to sell Cristiano Ronaldo with a profit to Juventus for 117M.

One can see other strong connections as well, such as Paris SG paying a fortune to Barcelona for Neymar and Monaco for Kylian Mbappé. There are also a few typical paths players take – Borussia Dortmund to Bayern Munich, Atlético Madrid to Barcelona, or vice versa. It’s also interesting to see how many different edges connect to these giants. Man City has been doing business worth over EUR 1M with 27 different clubs.

Mentors

The second network shows which clubs grow talent instead of buying them and have received a substantial amount of money in return. Node sizes represent the amount of transfer fees received. This paints a very different picture from our first network except for one huge similarity: Real Madrid. In the past, they were considered the biggest spenders. They have since adopted a more business-focused strategy and managed to sell players for high fees as mentioned above.

A striking difference, however, is while the top spenders were all part of the top five leagues, the largest talent pools came from outside this cohort except for Monaco. Benfica, Sporting, and FC Porto from Portugal, and Ajax from the Netherlands are all famous for their young home-grown talents, and used as a stepping stone for players from other continents. Ajax has sold players who competed in this World Cup for over EUR 560M. Their highest received transfer fees include 85.5M for Matthijs de Ligt from Juventus and 86M for Frenkie de Jong from Barcelona. Ajax signed de Jong for a total of EUR 1 from Willem II in 2015 when he was 18, and de Light grew up in Ajax’s famous academy. Not to mention that they recently sold Brazilian Antony to Manchester United for a record fee of 95M. They paid 15.75M for him just 2 years ago – that’s almost 80M in profit. Insane!

Benfica earned close to 500M, most recently selling Uruguayan Darwin Nunez for 80M to Liverpool. The record fee they received is a staggering 127M for Portuguese Joao Félix from Atlético Madrid, who grew up at Benfica. Monaco earned 440M from selling players such as Kylian Mbappé (180M) and Aurélien Tchouameni (80M), Portuguese Bernardo Silva (50M), Brazilian Fabinho and Belgian Youri Tielemans (both for 45M). These clubs have become incredible talent pools for the bigger clubs, therefore really appealing to young players. It’s interesting to see how many edges the nodes for these clubs have, further proving that these teams function as a means for reaching that next level.

Player network

After looking at the club-to-club relationships, zoom in on the network of players binding these top clubs together. Here, we built on the players’ transfer histories again and reconstructed their career timelines. Then we compared these timelines between each pair of World Cup players, noted if they ever played for the same team, and if so, how many years of overlap they had (if any).

To our biggest surprise, we got a rather intertwined network of 830 players connected by about 6,400 former and current teammate relationships, as shown in Figure 2. Additionally, the so-called average path length turned out to be 3 – which means if we pick two players at random, they most likely both have teammates who played together at some point. Node sizes were determined by a player’s current market value, and clusters were colored by the league’s nation where these players play.

Figure 2. The player-level network showing previous and current teammate relationships. Note size corresponds to the players’ current market values, while color encodes their nationality based on their country’s flag’s primary color. See the interactive version of this network here.

It didn’t come as a surprise that current teammates would be closer to each other in our network. You can see some interesting clusters here, with Real Madrid, Barcelona, PSG, and Bayern Munich dominating the lower part of the network and making up its center of gravity. Why is that? The most valuable player of the World Cup was Kylian Mbappé, with a market value of 160M, surrounded by his PSG teammates like Brazilians Marquinhos and Neymar and Argentinian Lionel Messi. Messi played in Barcelona until 2021, with both Neymar and Ousmane Dembélé connecting the two clusters strongly. Kingsley Coman joined Bayern Munich in 2017, but he played for PSG up until 2014, where they were teammates with Marquinhos, thus connecting the two clusters.

You can discover more interesting patterns in this graph, such as how the majority of the most valuable players have played together directly or indirectly. You can also see Englishmen Trent Alexander-Arnold (Liverpool) or Declan Rice (West Ham United) further away from the others. Both of those players only ever played for their childhood clubs. But the tight interconnectedness of this network is also evident with how close Alexander-Arnold actually is to Kylian Mbappé. During the 2017–2018 season, Mbappé played at Monaco with Fabinho behind him in midfield, who signed for Liverpool at the end of the season, making him and Alexander-Arnold teammates.

With the World Cup hosting hundreds of teams’ players from various nations, there are obviously some clusters that won’t connect to these bigger groups. Many nations have players who have only played in their home league, such as this World Cup’s host nation Qatar (maroon cluster in the top left corner). Saudi Arabia (green cluster next to Qatar) beat Argentina, causing one of this year’s biggest surprises. Morocco (red cluster in the top right corner) delivered the best-ever performance by an African nation in the history of the World Cups. Both of those nations join Qatar in this category of home-grown talent. These players will only show connections if they play in the same team – in the case of the Moroccan cluster, that team is Wydad Casablanca. The Hungarian first league’s only representative at the World Cup, Tunisian Aissa Laidouni from Ferencváros hasn’t played with anyone else on a club level who has made it to the World Cup. He became a lone node on our network. That shouldn’t be the case for long, considering how well he played in the group stages.

Conclusion

In conclusion, we saw in our analysis how network science and visualization can uncover and quantify things that experts may have a gut feeling about but lack the hard data. This depth of understanding of internal and team dynamics that is possible through network science can also be critical in designing successful and stable teams and partnerships. Moreover, this understanding can lead to exact applicable insights on transfer and drafting strategies or even spotting and predicting top talent at an early stage. While this example is about soccer, you could very much adapt these methods and principles to other collaborative domains that require complex teamwork and problem solving with well-defined goals, from creative production to IT product management.

The post <strong>FIFA World Cup 2022 – The Network Edition</strong> appeared first on Nightingale.

]]>
14626
Behind the Scenes of “The Future of Data Science”: An Interview with Ciera Martinez https://nightingaledvs.com/behind-the-scenes-of-the-future-of-data-science-an-interview-with-ciera-martinez/ Wed, 26 Oct 2022 13:00:00 +0000 https://dvsnightingstg.wpenginepowered.com/?p=13547 A few months ago I had the pleasure of sitting down with Ciera Martinez to discuss the founding of the project, Data Science by Design,..

The post Behind the Scenes of “The Future of Data Science”: An Interview with Ciera Martinez appeared first on Nightingale.

]]>
A few months ago I had the pleasure of sitting down with Ciera Martinez to discuss the founding of the project, Data Science by Design, and what they’ve been up to recently. We discussed the process of creating the anthology as well as the role data science plays in society today. 

1. How did the concept behind Data Science x Design come about?

DSxD really began from sending fun Slack messages. In 2020, Sara Stoudt, Valeri Vasquez, and I were working together at the Berkeley Institute for Data Science as researchers. There, we had endless conversations about why we love data work and a lot of the reasons were based on creativity, design thinking, using data as a tool to do good, and connecting with people. We were constantly sending each other Slack messages of links to zines, inspiring data work, and people we admire. What we shared was quite different from the academic and industry view of data science — efficiency, automation, data as oil, and so on — which unfortunately is how most people see the field. We saw the creative side of data science and observed this in the zeitgeist of data topics on the internet, like Twitter. We just wanted to collect and amplify all these voices and to build and create things with them. To elevate the less jargon-y, less academic side of data science. We were fortunate enough to get two grants that supported this vision. We reached out to other like minded people to what eventually became the Leadership team (Sara Stoudt, Valeri Vasquez, Tim Schoof, Lauren Renaud, Natalie O’Shea and I). This all led to the creation of the book, our mini-grant program, and events / gatherings. So now, DSxD is largely shaped by the people who gravitate towards it. People interested in DSxD and the anthologies come from a variety of backgrounds, but I think an underlying thread is that most consider themselves part of multiple disciplines – artist AND researcher, or designer AND scientist.  

2. What is your goal/mission as an organization and for the anthology in particular?

We celebrate the fundamental creativity of data science. We support those who leverage creative mediums, design thinking, and storytelling to convey the practice and insights of data science. We also aim to establish a community dedicated to developing a more open, ethical, and inclusive future for the field.

Our aim is to re-brand data science, so ultimately it attracts more diversity in the types of people who work with data. Data science should be thought about more broadly, bringing society, art, and process into how we view data work.

3. What was the overall process for creating the anthology?

It is outlined pretty well in this infographic we created below. We hold events to get feedback, connect, and inspire. Then the leadership team functions like a yearbook committee putting the book together. We also have mini grants to help people develop their ideas and execute their projects. From there, we get illustrators and designers involved to beautify what everyone submits.

4. Who was involved in creating the anthology? How did you get such a great collection of people together?

At the first event, Creator Conf, the vision for the anthology was refined through discussions at the conference sessions. The people who joined us really shaped the scope of the anthology, from the type of subject matter we were looking for to the type of people to invite to contribute.

As for the contributors, it was a mix of people from the DSxD community and us reaching out to people we were all fan-girling about. We strove to collect unique voices around how data is used and spent the rest of the time elevating and beautifying their vision.

5. What gaps does this anthology address in the conversations happening around data science?

We strive to highlight the design process involved in data work. While aesthetic design principles are one aspect of data work, we explore other design principles and guidelines, like those concerning ethics and function. An underlying concept of everything we work on is “show your work.” We expect transparency with the people behind the data work, not only because there are ethical concerns inhiding the people behind data work (see Data Feminism chapters “Numbers don’t speak for themselves” and “Show your work“), but also because exposing the steps that lead up to the final results is always interesting and inspiring, and many times beautiful.

6. In “Writing a modelers MaNifesto,” the author says, “what I’ve realized is that modeling is also political.” How, if at all, is data science as a field political and why is that something we should keep in mind?

This is an ongoing conversation within our community that strives to articulate this for ourselves and within our work. In our book club, we discuss this regularly in the context of defining what data is in our society. Data is ever present and we are unable to opt out.

Therefore, the omnipresence of data in everyone’s lives forces data science to be political. Where there are people, there is politics.

The issues that society has are reflected in how we handle data and our society. Data is always an abstraction, highly filtered through the people who collect and analyze it. Being a data practitioner requires you to delve into power dynamics in society, because understanding the intricacies of people’ roles in society makes you a better data scientist. Understanding how the data has been touched by people and where your data is coming from greatly informs the quality of your data work and helps you find more accurate patterns. 

7. Word around town is that you are all working on another book, what can you tell us about that?

Yes!!! We are ramping up for another cycle and we are so excited. This next theme is “Our Environment” which is a way for us to delve into understanding the many worlds we occupy and how data fits into these worlds. From our natural world to virtual reality. We are now accepting submissions!


You can buy Volume 1. The Future of Data Science, here.

The post Behind the Scenes of “The Future of Data Science”: An Interview with Ciera Martinez appeared first on Nightingale.

]]>
13547