Skip to main content

Cheat Sheet

Author @Saief1999

0. Introduction

Goal of this course : raising awareness about issues we (as users of the digital word) are exposed to.

1. Ethical frameworks

Reference : What are Ethical Frameworks

Ethical frameworks are perspectives useful for reasoning what course of action may provide the most moral outcome.

In many cases, a person may not use a reasoning process but rather do what they simply feel is best at the time. Others may reflexively use a principle they learned from their family, peers, religious teachings or own experiences.

The study of ethics has provided many principles that can aid in ethical decision making. Some of the most common are captured in the following 5 ethical frameworks:

a. Virtue theory : Aristote and Virtue Theory

-> What is moral is what makes us the best person we could be.

Ehtical theory that emphasizes an individuals's character rather than following a set of rules.

If we can just focus on being good people, the right actions will follow, effortlessly.

Aristote argues that nature has built into us the desire to be virtuous (good people).

Virtue : The Golden mean in out attitudes. (not too much, not too little)

e.g. : Courage is the golden mean of recklessness and cowardice.

Aristote believes there are moral exemplars that we generally tend to follow as we believe in them.

Eudaimonia : A life of striving, pushing yourself to your limits and finding success. Being the best person you could be. For Aristote, that's the kind of person that will do good things.

b. Kantianism : Kant and categorial imperatives

Also called Deontology

-> What is moral is what follows from absolute moral duties.

In order to determine what's right. We need to use reason.

Hypothertical imperatives : Commands we should follow if we want something. ( e.g. if we want to pass the exam, we should study )

Categorial imperatives : Commands you must follow, regardless of your desires. Moral obligations that are derived from pure reason ( binding on all of us ).

Kant : It's not fair for you to make exceptions for yourself. Follow moral absolute rules instead ( even when you think your situation does not abide to that rule -> e.g : lying to save someon's life ).

c. Utilitarianism

-> What is morally right is what generates the best outcome for the largest number of people.

Moral theory that focuses on the results or consequences of our actions, and treats intentions as irrelevant.

Good consequences = Good actions

Actions should be measured in terms of the happiness, or pleasure, that they produce.

This moral theory contradicts kantianism.

Rule Utilitarianism : Version of the theory that says we ought to live by rules that, in general, are likely to lead to the greatest good fot the greatest number. ( look at the outcome in the long run, not in the short run. e.g. Don't harvest people for their organs even if it can save many other people on the short run. We will have to live in constant fear in the long run... )

d. Contractarianism : Hobbes and contratarianism

Also called Rights-based Ethics

-> What is moral is that which is in accord with everyone's rights.

Moral theory that believes right acts are those that do not violate the free, rational agreements that we've made.

No rules -> A lot of freedom, but no security ( other people might do things that are not in accord with your rights ). ( A terrible way to live... )

Free, rational, self-interested people realize that there are more benefits to be found in cooperating than in not cooperating.

defection : When you break the contract you're in - whether you agreed to be in it or not - and you decide to look after your own interests, instead of cooperating.

Morality is determined by a group of contractors : whatever they agree to goes.

No one should be forced into a contract. It should make your life overall better (even if, in some aspects, it favours other people)

If, as a group, we change our minds, we can simply modify the contract. Which is exactly what happens, explicitly, when we change laws, and implicitly, with shifting social mores.

e. Care-based ethics / Feminist ethics

-> What is moral is that which promotes healthy relationships and the well-being of individuals and their interdependence.

In this paradigm, women nurturing relationship is taken as a modal for care. Where everyone has responsibilities and obligation to care for others.

Caring : everything done directly to help others

  • Meet their basic needs
  • Sustain their basic capabilities
  • Alleviate pain or suffering

This theory argues that other ethical frameworks are not universal and don't apply to everyone. ( e.g. woman's moral development was much more categorized by a morality of responisbility and care).

They argue that ethical frameworks are centered around reason, law and justice (which were though to be male values) and not much on feminine values ( such as empathy, relatedness and responsiveness ).

Care based ethics : men and women differ in their decision making

Consequences :

  • We must be able to take different prespectives (someone with a different gender, race, nationality, class... )
  • Self reflection ( we are prone to biases )
  • Open to feedback and suggestions
  • Engage in critical conversations and debates

2. Privacy

Privacy is the condition of being concealed or hidden. All that is not hidden is therefore not private.

Neotas : a tool used for social media screening prior to hiring.

Why don't we care about privacy in Tunisia?

We have the habit of over-sharing sensitive information that we don’t realize how sensitive it is anymore.

We live in a country that is not digitized enough so even if they have sensitive data they either don’t have access to it or there’s nothing to do with.

A. Belmont Report 1978

The Belmont Report marks an important milestone in the history of clinical research. It established guidelines for basic ethical principles, as well as informed consent, the assessment of risks and benefits and subject selection.

Motivation

The Belmont Report was written in response to the infamous Tuskegee Syphilis Study, in which African Americans with syphilis were lied to and denied treatment for more than 40 years. Many people died as a result, infected others with the disease, and passed congenital syphilis onto their children.

Following the Tuskegee study, Congress passed the National Research Act, creating the National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research. This commission met regularly for nearly four years, culminating in a four-day discussion at the Smithsonian Institution’s Belmont Conference Center in February 1976.

The Belmont Report addresses informed consent as a necessary part of showing respect for all persons. It states that all subjects, to the degree that they are capable, should be given the opportunity to choose what shall or shall not happen to them.

According to the report, informed consent requires three elements: information, comprehension and voluntariness.

a. Information

Research subjects “must be given sufficient information about the research procedure, their purposes, risks and anticipated benefits and alternative procedures (where therapy is involved).” They should be given the opportunity to ask questions and have the right to withdraw from the research at any time.

In cases where informing subjects about some pertinent aspect of the research is likely to impair the validity of the research, the Belmont Report states withholding information is justified only if the following three criteria apply:

  1. Incomplete disclosure is truly necessary to accomplish the goals of the research
  2. There are no undisclosed risks to subjects that are more than minimal
  3. There is an adequate plan for debriefing subjects, when appropriate, and for disseminating research results to them

Researchers should never withhold information about risks for the purpose of getting a subject to cooperate.

b. Comprehension

The Belmont Report states that “the manner and context in which information is conveyed is as important as the information itself.” For instance, allowing too little time for the subject to consider the information could affect their ability to make an informed choice.

That means researchers need to consider a subject’s maturity, capacity for understanding, language and literacy when presenting information to obtain informed consent. In some cases, the report states, it may be appropriate to give oral or written tests of comprehension.

When a subject’s comprehension is severely limited due to age, disability or other factors, researchers need to seek the permission of other parties to protect them from harm.

c. Voluntariness

Informed consent means there is no coercion or undue influence. In other words, researchers cannot threaten harm or offer an “excessive, unwarranted, inappropriate or improper reward” to obtain compliance.

That means researchers need to take special care when conducting clinical trials involving vulnerable people who are under the authority of someone else, such as inmates or people who are ill.

The rise of the digital era with its massive repositories of datasets free for the taking have upended all of the ethical practices that had developed over the previous half century.

Once researchers themselves were no longer collecting data, but rather making use of data that someone else had collected for an entirely different purpose, perhaps without the knowledge of the subject or even against the explicit demands of the subject.

those researchers argued they shouldn’t be held responsible for ethical data use anymore, since they were merely using someone else’s data.

C. GDPR

-> Consent of the data subject means any freely given, specific, informed and unambiguous indication of the data subject’s wishes by which he or she, by a statement or by a clear affirmative action, signifies agreement to the processing of personal data relating to him or her.

  • freely given : given on a voluntary basis. The element “free” implies a real choice by the data subject. Any element of inappropriate pressure or influence which could affect the outcome of that choice renders the consent invalid.

    • For one thing, that means you cannot require consent to data processing as a condition of using the service. They need to be able to say no.
    • The one exception is if you need some piece of data from someone to provide them with your service.
  • Consent is specific. It should be clear what data processing activities you intend to carry out, granting the subject an opportunity to consent to each activity ( you must obtain consent for each purpose separately ).

  • Informed consent means the data subject knows your identity, what data processing activities you intend to conduct, the purpose of the data processing, and that they can withdraw their consent at any time.

  • The data subject must also be informed about his or her right to withdraw consent anytime. The withdrawal must be as easy as giving consent.

  • Consent must be unambiguous, which means it requires either a statement or a clear affirmative act. It cannot be implied and must always be given through an opt-in, a declaration or an active motion, so that there is no misunderstanding that the data subject has consented to the particular processing.

  • For those who are under the age of 16, there is an additional consent or authorisation requirement from the holder of parental responsibility.

Introduced in 2018 as part of the General Data Protection Regulation (GDPR) by EU, these notices ask users to agree to being tracked when visiting a site for the first time.

The Cookie Law actually applies not only to cookies but more broadly speaking to any other type of technology that stores or accesses information on a user’s device

The cookie consent policy got updated in 2020 with these new changes :

  • Cookie Walls are prohibited ( since they prevent users from using a website without the consent and he is not longer presented with a choice )
  • the EDPB (European Data Protection Board) does not consider consent via scrolling or continued browsing to be valid.

4. Transparency

A. Definition

Transparency :

  • The ability to easily access and work with data no matter where they are located or what application created them.
  • The assurance that data being reported are accurate and are coming from the official source.

B. Open Data

Data or content is open if anyone is free to use, re-use or redistribute it, subject at most to measures that preserve provenance and openness.

There are two dimensions of data openness :

  1. The data must be legally open, which means they must be placed in the public domain or under liberal terms of use with minimal restrictions.
  2. The data must be technically open, which means they must be published in electronic formats that are machine readable and non-proprietary, so that anyone can access and use the data using common, freely available software tools. Data must also be publicly available and accessible on a public server, without password or firewall restrictions. To make Open Data easier to find, most organizations create and manage Open Data catalogs.

C. Anonymizing data

Anonymizing data is very important to ensure that the privacy of users is guaranteed.

a. Data masking

-> Hiding data with altered values.

You can create a mirror version of a database and apply modification techniques such as character shuffling, encryption, and word or character substitution.

For example, you can replace a value character with a symbol such as “*” or “x”. Data masking makes reverse engineering or detection impossible.

b. Pseudonymization

can be achieved by subtitution

a data management and de-identification method that replaces private identifiers with fake identifiers or pseudonyms.

For example replacing the identifier “John Smith” with “Mark Spencer”.

Pseudonymization preserves statistical accuracy and data integrity, allowing the modified data to be used for training, development, testing, and analytics while protecting data privacy.

c. Generalization

Deliberately removes some of the data to make it less identifiable. Data can be modified into a set of ranges or a broad area with appropriate boundaries. You can remove the house number in an address, but make sure you don’t remove the road name. The purpose is to eliminate some of the identifiers while retaining a measure of data accuracy.

d. Data swapping

Data swapping—also known as shuffling and permutation, a technique used to rearrange the dataset attribute values so they don’t correspond with the original records. Swapping attributes (columns) that contain identifiers values such as date of birth, for example, may have more impact on anonymization than membership type values.

e. Data perturbation & noise addition

modifies the original dataset slightly by applying techniques that round numbers and add random noise. The range of values needs to be in proportion to the perturbation. A small base may lead to weak anonymization while a large base can reduce the utility of the dataset.

For example, you can use a base of 5 for rounding values like age or house number because it’s proportional to the original value. You can multiply a house number by 15 and the value may retain its credence. However, using higher bases like 15 can make the age values seem fake.

f. Synthetic data

Algorithmically manufactured information that has no connection to real events. Synthetic data is used to create artificial datasets instead of altering the original dataset or using it as is and risking privacy and security. The ptocess involves creating statistical models based on patterns found in the original dataset. You can use standard deviations, medians, linear regression or other statistical theniques to generate the synthetic data.

D. Importance of anonymizing data

Theory that says if you have gender, date of birth, and zip code, you can easily identify any person (87% of the people can be identifiable)

Even with just two of these information, it is possible to find the person’s identity

5. Bias

A. The trouble with Bias

1. What is Bias ?

  • Bias has different definitions which adds to the confusion
  • 1900’s definition: Systematic differences between a sample and a population
  • In law: Bias means judgment based on preconceived notions or prejudice (impartiality)
  • Bias comes from:
    • In most of cases, data it was trained on (training data can be be complete, bias, skewed, non representative, etc)
    • Humans biases and cultural assumptions can cause bias by introducing exclusion or underrepresentation of certain populations
    • 4.4 million people arrested on suspicion between 2004 an 2012, 83% were black or hispanic
    • Social sciences and humanities have decades of research on bias in social systems

2. Harms of allocation

Allocated harm : When a system allocates or withholds certain groups an opportunity or a resource ( e.g. who gets a mortgage? who gets a loan? )

Allocation harm : Immediate, easily quantifiable, Discrete, transactional

3. Harms of representations

Representation harm : They occur when a system reinforces the subordination of some groups along the lines of identity ( race, class, gender... )

They occur regardless of whether resources are withheld from members of a protected class.

Representation harm : Long terms, difficult to formalize, Diffuse, cultural

e.g. gorillas example

a. Stereotyping

Doctor / nurse

b. Recognition

Does a system recognizes a face, a personality, ...

e.g. : Asian people recognized as blinking

c. Denigration

When people use culturally disparaging terms

Autocomplete : jew should -> first result -> be wiped out

d. Under representation

e.g. Research for a CEO in google : all males ( and one barbie... )

Dilemma : Do you show reality as it is or as you want it to be?

4. Politics of classification

  • Classification is always a product of its time ( always depends on the time and its cultures )
    • sometimes they tend to "stick" more than they're supposed to do ( and therefore become outdated, and result in bias )
  • Machine learning can have classification that cause, criminal prosecution, jailing or worse

5. What can we do?

  1. We need to start working on Fairness forensics : What we can do to test our systems (building pre-release trials, track the lifecycle of a datase to know who built it and what demographic skews are in that set)
  2. Start taking interdisciplinarity seriously ( working with people not in our field but have expertise across other areas )
  3. Think harder about the ethics of classification ( where it should be / should not be done )

B. Algorithmic Bias and Fairness

Algorithmic Bias : Biases in the real world get mimiced, or even exagerated by AI systems

We need to know the difference between bias (which we all have) and descrimination (which we can prevent).

1. Data reflects existing biases

Nurses and programmer search (nurses are generally returned as women, and programmers as male )

Correlated features : things that do not exist in the data but got concluded from it.

  • Zip code -> strongly correlated to -> race
  • Sexual orientation -> strongly correalted to -> characteristics of social media photos

2. Unbalanced classes in training data

Models are trained with data that is predominantly white. Making it difficult to operate on other races too.

3. Data doesn't capture the right value

it's not easy for a computer to take into consideration the different factors we do

e.g. : AI in SAT and GRE tests ( generate random paragraphs that get accepted as valid answers )

4. Data amplified by a feedback loop

positive feedback loop : amplifying what happened in the past. Whether or not this amplification is good

e.g. Crime prediction algorithm, that focuses only on one neighbourhood

e.g. Principal and school ( focus on slacking pupils, ignore the rest )

5. Malicious data attack or manipulation

e.g. TAY AI chatter bot that was targeted by an attack, outputs offensive language.

6. Conclusion

AI systems are trying to make good predictions , but they make mistakes:

  1. We need to understand that algorithms will be biased. It's important to be critical about AI recommendations instead of just accepting them.
  2. If we want to have less biased algorithms, we may need more training data on protected classes like race, gender, or age. looking at an algorithms recommendations for protected ckasses may be a good wat to check it for discrimination.

C. Data validity

Often limited by the data we have and doesn’t have access to data we want to have.

The big question is, what constitutes valid data that we can base our speculations on ?

Data validity issues :

  • Choice of the attributes or measures
  • Error in data processing
  • Error in model design
    • Simpson's paradox : a phenomenon in probability and statistics in which a trend appears in several groups of data but disappears or reverses when the groups are combined.
  • Campbell's law
    • the observation that once a metric has been identified as a primary indicator for success, its ability to accurately measure success tends to be compromised.
    • e.g. : a decrease in a city’s crime rate may not demonstrate a true reduction in the number of crimes that have been committed, but may simply reflect how the police force has changed procedures to lower the number. They may have decided, for example, to change which police encounters need to be formally recorded. They may also have downgraded some crimes to less serious classifications.

Bad analysis of good data:

  • Correlation
  • Misleading results
  • p-hacking : a misuse of data analysis to find patterns in data that can be presented as statistically significant when in fact there is no real underlying effect