CS2710, ISSP 2160 Fundamentals of Artificial Intelligence, Fall 2014
                Assignment 5:  Uncertainty

By submitting a solution to this assignment, you attest that you (1)
did all of the work, (2) that you did it alone, and (3) that you did
it without using resources outside of those provided by class
(CS2710/ISSP 2160 Foundations of Artificial Intelligence, Fall 2014),
unless you explicitly state otherwise (in which case exactly what must
be clearly indicated).


------------------------------------------------------

1. Suppose that A is independent of B.  It is not true that, for all
   C, A is conditionally independent of B given the value of C.  Give an
   example that illustrates this.

2. Assume that 2% of the population in a country carry a particular
   virus.  A test kit is able to detect the presence of the virus from
   a patient's blood sample.   The test kit has the following
   accuracy:

    P(the kit shows positive | the patient is a carrier) = 0.998
    P(the kit shows negative | the patient is not a carrier) = 0.996

  What is the probability of a false positive?  

From:  http://www.tc3.edu/instruct/sbrown/stat/falsepos.htm: 

Out of 1,098 tests that report positive results, 99 (9%) are correct
and 999 (91%) are false positives. Therefore the probability that you
actually have disease D, when you're given a positive test result, is
only 9%. Symbolically you can write this as (P(have D | test positive)
= 9%.

3.  Assume the following conditional probabilities are available.

P(WetGrass|Sprinkler, Rain) = 0.95
P(WetGrass|Sprinkler, ~Rain) = 0.9
P(WetGrass|~Sprinkler, Rain) = 0.8
P(WetGrass|~Sprinkler, ~Rain) = 0.1
P(Sprinkler|RainySeason) = 0.01
P(Sprinkler|~RainySeason) = 0.9
P(Rain|RainySeason) = 0.9
P(Rain|~RainySeason) = 0.2
P(RainySeason) = 0.7

Construct a Bayesian Network (including the conditional probability
tables and the graph structure), and determine the probability 
P(WetGrass, RainySeason, ~Rain, ~Sprinkler)

4.  Below is a data set from the UC Irvine Machine Learning
    repository.   It concerns whether or not (T, F) a balloon is inflated.

YELLOW,SMALL,STRETCH,ADULT,T
YELLOW,SMALL,STRETCH,ADULT,T
YELLOW,SMALL,STRETCH,CHILD,F
YELLOW,SMALL,DIP,ADULT,F
YELLOW,SMALL,DIP,CHILD,F
YELLOW,LARGE,STRETCH,ADULT,T
YELLOW,LARGE,STRETCH,ADULT,T
YELLOW,LARGE,STRETCH,CHILD,F
YELLOW,LARGE,DIP,ADULT,F
YELLOW,LARGE,DIP,CHILD,F
PURPLE,SMALL,STRETCH,ADULT,T
PURPLE,SMALL,STRETCH,ADULT,T
PURPLE,SMALL,STRETCH,CHILD,F
PURPLE,SMALL,DIP,ADULT,F
PURPLE,SMALL,DIP,CHILD,F
PURPLE,LARGE,STRETCH,ADULT,T
PURPLE,LARGE,STRETCH,ADULT,T
PURPLE,LARGE,STRETCH,CHILD,F
PURPLE,LARGE,DIP,ADULT,F
PURPLE,LARGE,DIP,CHILD,F

Give the following probabilities:

P(F|Yellow,Small) =

P(F,Yellow,Small) = 

P(T,Adult|Purple) = 

5.  We want to classify athletes as either not rich or rich.
Each athlete plays either basketball or tennis (and is either a male or
a female).  Thus, we have:

Economic status (E):  not-rich, rich
Sport (S):  basketball, tennis
Gender (G): male, female
total number of athletes:  640
320 athletes are rich (E = rich)
   160 are basketball players  (E = rich, S = basketball)
       40  are female   (E = rich, S = basketball, G = female)
      120  are male     (E = rich, S = basketball, G = male)
   160 are tennis players (E = rich, S = tennis)
      120  are female    (E = rich, S = tennis, G = female)
       40  are male      (E = rich, S = tennis, G = male)
320 athletes are not rich (E = not-rich)
   160 are basketball players  (E = not-rich, S = basketball)
      120  are female   (E = not-rich, S = basketball, G = female)
       40  are male     (E = not-rich, S = basketball, G = male)
   160 are tennis players (E = not-rich, S = tennis)
       40  are female    (E = not-rich, S = tennis, G = female)
      120  are male      (E = not-rich, S = tennis, G = male)

Are G and S conditionally independent given E? Please support your answer.

6.  See the Appendix Below for information about the Naive Bayes
    probabilistic classifier.

                 positive    negative
P(Class)            0.5        0.5
*Size*  ----------------------------------
P(small|Class)      0.4        0.4
P(medium|Class)     0.1        0.2
P(large|Class)      0.5        0.4
*Color* ----------------------------------
P(red|Class)        0.9        0.3
P(blue|Class)       0.05       0.3
P(green|Class)      0.05       0.4
*Shape* ----------------------------------
P(square|Class)     0.05       0.4
P(triangle|Class)   0.05       0.3
P(circle|Class)     0.9        0.3
------------------------------------------

(6.A) Apply EQ1 to the test instance <medium,red,circle>.  Please show
   your work.

(6.B) Calculate EQ2 for the same test instance, for both the positive
   and negative classes.  (How you can derive the denominator, given
   what you have?)  Please show your work.

(6.C)  

One approach to resolving ambiguous words in English is to use
Bayesian reasoning based on surrounding words. Consider the following
three meanings of the word "class":

1. "prototype for an object in Object-Oriented Programming(OOP)";
2. "education imparted in a series of lessons or class meetings";
3. "people having the same social or economic status";

Assume we treat the presence (or absence) of the following words
anywhere in a sentence as evidence:

  "people" ("People often forget to define a deconstructor for their class", 
            "People are often late to class", "The struggle of lower
             class people is the driving force of progress");
  "program" ("This program does not use the window class", "This class
            is a required part of the natural science program", "The
            government's tax program does not address the needs of the lower
            class");
  "student" ("This window class was written by a clever student",
              "The student was late to class", "The student was concerned with
               the problems of the working class");
  "education" ("Learning how to write an abstract class is a vital
               part of your education", "Not attending the class will hamper your
               education", "Lowering the cost of education is an important issue
              for the middle class").

Assume that the following prior and conditional probabilities are
measured (where m is a possible meaning for the ambiguous word).
E.g.: 
P('student' appears in a sentence | 'class' has the "lessons"
meaning in that sentence) = 0.2

P('student' does not appear in a sentence | 'class' has the "lessons"
meaning in that sentence) = 0.8


m                  OOP     lessons economic status  
P(m)               0.1       0.6      0.3
P(`people' | m)    0.001     0.1      0.1
P(`program' | m)   0.1       0.01     0.001
P(`student' | m)   0.01      0.2      0.01
P(`education' | m) 0.005     0.05     0.05

Apply the Naive Bayes classifier to determine which is the most
probable meaning of "class" in the sentence "Did the student complete
the homework program for the class?"  

Please show your work.

   
7.   Consider the following Bayesian Network:

 A72 -->   A2 -- > A6 < -- A7 --> A4
                  \             /
                   \           /
                    \         /
                     \       /
                      \     /
                       v   v
              A5 --->   A1  

Are A72 and A5 conditionally independent given A2?

Are A72 and A5 d-separated given A2?

Are A1 and A7 d-separated given A6?

Are A1 and A7 d-separated given A6, A2?

Are A2 and A4 d-separated given A1, A7?


8. R&N 14.1.   

9. Prove that a variable is independent of all other variables in the
   network, given its Markov Blanket.  In your answer, refer to Figure
   14.4.  You can assume the TA will have Figure 14.4 in front of him
   when reading your answer.  


=============================================
10. Partially trace the decision tree induction algorithm given in lecture on the
   following data.  Specifically, show which attribute is chosen as
   the root (feel free to use entropy.py), and then show the first recursive
   calls (i.e., all the calls to DTL the first time the for-loop is executed).

Day   out   temp   hum   wind   playtennis               
d1    sunny  hot   high  weak    no
d2    sunny  hot   high  strong  no
d3    over   hot   high  weak    yes
d4    rain   mild  high  weak    yes
d5    rain   cool  norm  weak    yes
d6    rain   cool  norm  strong  no
d7    over   cool  norm  strong  yes
d8    sunny  mild  high  weak    no
d9    sunny  cool  norm  weak    yes
d10   rain   mild  norm  weak    yes
d11   sunny  mild  norm  strong  yes
d12   over   mild  high  strong  yes
d13   over   hot   norm  weak    yes
d14   rain   mild  high  strong  no

=============================================

Appendix on Naive Bayes 

Probabilistic Classification

Assign the class that is most probable, given a combination of
attribute values.  The attribute values are all given - they are
evidence variables.

answer = argmax P(class|a1, a2, ..., an) 
         class

You might not be familiar with argmax.  Here is a use that shows how
it works:

  maxAbsoluateValue (S) = argmax  | s | 
                          s in S 

  "the s in S such that the expression |s| is maximal"

  E.g.:  maxAbsoluateValue([4, -8, -1, 3]) ==> -8

Apply Bayes' Rule

Plugging into the above:
answer = argmax  P(a1, a2,...,an | class) P(class)
         class    -----------------------------
                         P(a1, a2,...,an)

We can ignore the denominator, since it is a constant value that is
the same for all classes.  It won't determine which class we choose.

So, our classifier is:
answer = argmax  P(a1, a2,...,an | class) P(class)
         class   

In the Naive Bayes model, the attributes are all conditionally independent of
each other, given the value of the class variable. 

Let's derive a Naive Bayes model.  Suppose we have three attributes, a1, a2, a3.

answer = argmax P(a1,a2,a3 | class) P(class) [from above]
         class

*apply definition of conditional probability*

answer = argmax P(a1,a2,a3,Class) P(Class)
         class  --------------------------
                    P(Class)

*P(Class) cancels*

answer = argmax P(a1,a2,a3,Class) 
         class  

*apply the chain rule*                

answer = argmax P(a1|a2,a3,Class) P(a2|a3,Class) P(a3|Class) P(Class)
         class  

*applying conditional independence assumptions*

(EQ1) answer = argmax P(a1|class) P(a2|class) P(a3|class) P(class)
               class  

Recall:   we dropped the denominator above.  We need it if we do want
the probability

(EQ2) P(class|a1,a2,a3) = P(a1|class) P(a2|class) P(a3|class) P(class)
                          --------------------------------------------
                                P(a1,a2,a3,a4)

*For the final, be sure you can derive EQ1, EQ2 and are able to explain the derivation*

*Example:*  Decide whether to play tennis (Yes, No)

Training data:

outlook  temp   humidity    wind   play
Sun        H       High       W      No
Sun        H       High       S      No
Over       H       High       W     Yes
Rain       Mild    High       W     Yes
Rain       Cool    Normal     W     Yes
Rain       Cool    Normal     S     Yes
Over       Cool    Normal     S      No
Sun        Mild    High       W     Yes
Sun        Cool    Normal     W      No
Rain       Mild    Normal     W     Yes
Sun        Mild    Normal     S     Yes
Over       Mild    High       S     Yes
Over       H       Normal     W     Yes
Rain       Mild    High       S      No

Compare:
  P(yes) P(sunny | yes) P(cool | yes) P(high | yes) P(strong | yes)
  P(no)  P(sunny | no)  P(cool | no)  P(high | no)  P(strong | no)

P(yes) P(sunny | yes) P(cool | yes) P(high | yes) P(strong | yes)
9/14  *  2/9  *  2/9  *  4/9  *  3/9  = .0053

P(no) P(sunny | no) P(cool | no) P(high | no) P(strong | no)
5/14  *  3/5  *  2/5  *  3/5   *  3/5  =  .0206

Answer:  No

The numbers are called "parameter estimates", specifically "maximum
likelihood estimates".  We estimated the parameters based on counts in
the training data.