Expectation of Random Variable: E[X] = \sum_i i \cdot P_X(i))

Expectation convert a fields to a constant.

\begin{align*}
E[X] =& \int_{i = -\infty}^\infty i f_X(i) di = \int_{i = -\infty}^\infty i Pr\{X = i\} di\\
=& \int_{i = -\infty}^\infty \overline{F_X}(i) di = \int_{i = -\infty}^\infty Pr\{X > i\} di\\
E[X^2] =& 2\int_{i = -\infty}^\infty i \overline{F_X}(i) di = 2\int_{i = -\infty}^\infty i Pr\{X > i\} di
\end{align*}

For any random variable X, Y:

E[X + Y] = E[X] + E[Y]

Linear Expectation (with indicator random variable), Sum from c.d.f, Sum from p.d.f, Consitioning are 4 ways to start.

For random variable X, Y with X \perp Y:

E[X \cdot Y] = E[X] \cdot E[Y]

Proof:

\begin{align*}
&E[X \cdot Y]\\
=& \int \int xy f_{xy}(x, y) dx dy\\
=& \int \int xy f_x(x) \cdot f_y(y) dx dy \tag{by independence}\\
=& E[X] \cdot E[Y]\\
\end{align*}

Let X \sim \text{Bernoulli}(p), then E[X] = p

Let X \sim \text{Geometric}(p), then

\begin{align*}
E[X] &= \sum_{i = 1}^\infty i \cdot (i-p)^{i-1}p\\
&= p\sum_{i = 1}^\infty i \cdot (i-p)^{i-1}\\
&= p(1 + 2(i-p)^1 + 3(i-p)^2 + ...)\\
&= \frac{p}{(1-(1-p))^2}\\
&= \frac{p}{p^2}\\
&= \frac{1}{p}\\
\end{align*}

By Conditioning:

Let X = \text{number of flips to get head}

Lemma: random variable [X | X > 1] = [1+X]. Proof see below.

\begin{align*}
E[X] &= E[X|\text{first flip is head}] \cdot p + E[X|\text{first flip is tail}] \cdot (1-p)\\
E[X] &= 1 \cdot p + E[1+X] \cdot (1-p) \tag{by Lemma}\\
E[X] &= p + (1 + E[X])(1-p)\\
E[X] &= p + (1-p) + E[X](1-p)\\
E[X] &= 1 + E[X](1-p)\\
1 - 1 + p &= \frac{1}{E[X]} \tag{assume $E[X] \neq 0$}\\
E[X] &= \frac{1}{p}\\
\end{align*}

\begin{align*}
[X | X > s] =& [s + X]\\
Pr\{X = t | X > s\} =& Pr\{s + X = t\}\\
Pr\{X = t | X > s\} =& Pr\{X = t - s\}\\
Pr\{X = t + s | X > s\} =& Pr\{X = t - s + s\}\\
Pr\{X = t + s | X > s\} =& Pr\{X = t\}\\
\end{align*}

Proof: Let X \sim \text{Geometric}(p), Y = [X | \text{1st flip is a tail}] = [X | X > 1]

Now, Y is a different random variable that has its own distribution. The range of distribution of Y (2, 3, 4, ...) is not the same as the range of distribution of X (1, 2, 3, 4, ...). Think of Y as the cut off of X from i = 2 to \infty.

We claim: Y =^d 1 + X by showing (\forall i = 2, 3, 4, ...)Pr\{Y = i\} = Pr\{1 + X < i\}. We only show i = 2, 3, 4, ... because we want to compare distributions with different range.

Let hand side:

\begin{align*}
&Pr\{X = i | X > 1\} \tag{where $i$ can only be $2, 3, 4, 5, ...$}\\
=& \frac{Pr\{X = i \cap X > 1\}}{Pr\{X > 1\}}\\
=& \frac{Pr\{X = i\}}{Pr\{X > 1\}}\\
=& \frac{(1 - p)^{i - 1}p}{1 - p}\\
=& (1 - p)^{i - 2}p\\
\end{align*}

Right hand side:

\begin{align*}
& Pr\{1 + X = i\} \tag{where $i$ can only be $2, 3, 4, 5, ...$}\\
=& Pr\{X = i - 1\}\\
=& (1 - p)^{i - 2}p\\
\end{align*}

Corollary: For X \sim \text{Geometric}(p), E[X^2 | X > 1] = E[Y^2] = E[(1 + X)^2]

Let X \sim \text{Poisson}(\lambda), then

\begin{align*}
E[X] &= \sum_{i = i}^\infty i \frac{e^{-\lambda}\lambda^i}{i!}\\
&= e^{-\lambda} \cdot \lambda \cdot \sum_{i = i}^\infty \frac{\lambda^{i - 1}}{(i - 1)!}\\
&= e^{-\lambda} \cdot \lambda \cdot (1 + \frac{\lambda^1}{1!} + \frac{\lambda^2}{2!} + \frac{\lambda^3}{3!} + ...)\\
&= e^{-\lambda} \cdot \lambda \cdot e^{\lambda}\\
&= \lambda\\
\end{align*}

Let X \sim \text{Binomial}(n, p), then E[X] = \sum_{i = 0}^\infty i {n \choose i} p^i (1 - p)^{n - i}

Calculate using Linear Expectation:

define X_i = \text{value of } i \text{-th coin flip} = \begin{cases} 1 & \text{if head}\\ 0 & \text{if tail}\\ \end{cases}

\begin{align*}
X = X_1 + X_2 + ... + X_n\\
E[X] = E[X_1] + E[X_2] + ... + E[X_n]\\
E[X] = p + p + ... + p\\
E[X] = np\
\end{align*}

Expectation of Function of Random Variable: E[g(X)] = \sum_i g(i) \cdot P_X(i))

- This is essentially map the random variable to a new value we care about, X become what we no longer care about

We have the following two methods to calculate expectation:

\begin{align*}
&\int f(x)Pr\{X = x\} dx = E[f(x)] \approx \frac{1}{N} \sum_{i = 1}^N f(x_i)\\
\implies& \int f(x) dx = \int \frac{f(x)}{p(x)}p(x) dx \approx \frac{1}{N} \sum_{i = 1}^N \frac{f(x_i)}{p(x_i)}
\end{align*}

This can be better understood when f(x) = x, in such case, we have \int x dx \approx \frac{1}{N}\sum_{i = 1}^N \frac{i}{Pr\{X = x_i\}}.

This leads us to the Monte Carlo Estimator: where w(x_i) is the a weighting function (usually w(x_i) = \frac{1}{Pr\{X = x_i\}})

\int f(x) dx \approx \frac{1}{N} \sum_{i = 1}^N f(x_i) w(x_i)

Expectation of Product of Random Variable: If X \perp Y, then E[XY] = E[X] \cdot E[Y]

Proof:

\begin{align*}
E[XY] &= \sum_x \sum_y xy P_{X, Y}(x, y)\\
&= \sum_x \sum_y xy P_{X}(x)P_{Y}(y) \tag{independence}\\
&= \sum_x x \cdot P_{X}(x) \sum_y y \cdot P_{Y}(y)\\
&= E[X] \cdot E[Y]
\end{align*}

Corollary: If X \perp Y, then E[g(X)f(Y)] = E[g(X)] \cdot E[f(Y)]. However, the reverse is not true.

Linearity of Expectation: E[X + Y] = E[X] + E[Y]

Proof:

\begin{align*}
E[X + Y] &= \sum_x \sum_y (x + y) P_{X, Y}(x, y)\\
&= \sum_x \sum_y x P_{X, Y}(x, y) + \sum_x \sum_y y P_{X, Y}(x, y)\\
&= \sum_x x \sum_y P_{X, Y}(x, y) + \sum_y y \sum_x P_{X, Y}(x, y)\\
&= \sum_x x P_X(x) + \sum_y y P_Y(y)\\
&= E[X] + E[Y]\\
\end{align*}

Say in Arknights you have n different characters and you want to collect them all. Each roll you have uniform \frac{1}{n} to get a specific character. What is the expected number of rolls to get full n characters?

Let X \sim \text{number of rolls to get full } n \text{ characters}. Let X_i \sim \text{number of rolls to get } i \text{-th character}. Then X = X_1 + X_2 + ... + X_n where X_i \sim \text{Geometric}(\frac{n - i + 1}{n})

\begin{align*}
E[X] &= E[X_1] + E[X_2] + ... + E[X_n]\\
&= \frac{n}{n} + \frac{n}{n-1} + \frac{n}{n - 2} + ... + n\\
&= n(\frac{1}{n} + \frac{1}{n-1} + \frac{1}{n - 2} + ... + 1)\\
&= n \cdot H_n\\
&= n \cdot \ln(n)\\
\end{align*}

Conditional p.m.f.: P_{X|A}(x) = Pr\{X = x | A\} where A is an event.

Conditional Expectation: E[X | A] = \sum_i i \cdot P_{X|A}(x)

Conputing Expectation via Conditioning: E[X] = E[X|A] \cdot Pr\{A\} + E[X|\bar{A}] \cdot Pr\{\bar{A}\}

Proof:

\begin{align*}
E[X] &= \sum_x Pr\{X = x\}\\
&= \sum_x (Pr\{X = x | A\} \cdot Pr\{A\} + Pr\{X = x | \bar{A}\} \cdot Pr\{\bar{A}\})\\
&= Pr\{A\} (\sum_x Pr\{X = x | A\}) + Pr\{\bar{A}\} (\sum_x Pr\{X = x | \bar{A}\})\\
&= Pr\{A\}E[X | A] + Pr\{\bar{A}\}E[X | \bar{A}]\\
\end{align*}

We have two treatments for kidney stones, their effectiveness result is below.

Facts:

- Treatment A is better in general. Treatment A is better for both cases: if doctor don't know whether patient have small or larger stones, it is still more likely patient will be healed by Treatment A.
- But in our sample, more patient with large stones come to Treatment A which brings down the aggregate mix of Treatment A (larger stones are harder to handle)
- more patient with small stones come to Treatment B which brings up the aggregate mix of Treatment B (small stones are easier to handle)
- However, if a patient end up taking Treatment A, then the patient is more likely to have bigger stones and therefore less success, if a patient end up taking Treatment B, then the patient is more likely to have smaller stones and therefore more success.
- Treatment CAUSES patient to heal. But the statistics does not indicate causation.

There are generally some methods to solve problems:

- Conditioning
- Linear Expectation (Variance, Transform)
- Summing Expectation
- Summing Tail
- Bayes Law
- Z-Transform and Laplace Transform
- Memoryless
- Integrate p.d.f. for c.d.f, Differentiate c.d.f. for p.d.f.

Table of Content