Posted by: tonyteaching | January 2, 2010

Memahami CORRELATION (Hubungan 2 hal)

Korelasi atau Correlation atau Hubungan atara 2 variables dapat diartikan: jika variable ke-1 berubah maka variable ke-2 juga akan berubah (bukan berarti variable ke-1 penyebab perubahan variable ke-2 ya!!)

Contoh umum dalam kehidupan saya:

Saya waktu jadi dosen di Yogya sukanya naik sepeda motor Honda ‘Supra Vit’ yang harga belinya Rp13 juta, …teman saya yang juga sama-sama dosen setiap hari mengendarai mobil ‘Suzuki Baleno’ yang harga beli second-nya aja masih di atas Rp 80juta.
Nah pertanyaannya: Siapa yang GAJInya lebih besar..saya atau teman saya???

Secara jamak, anda akan menjawab: Teman saya! atau anda mengambil kesimpulan gaji saya LEBIH RENDAH daripada gaji teman saya, dengan logika saya mampunya hanya beli dan operasional sepeda montor, teman saya mampu beli dan operasional mobil mahal…pasti teman Tony LEBIH BESAR GAJInya daripada Tony.

Dalam Kasus ini, anda atau pendapat orang secara umum sudah menganggap (berasumsi): Terdapat HUBUNGAN (Correlation) Positif antara Kendaraan yang dikendarai dengan Penghasilan Bulanan seseorang
(Semakin Bagus kendaraan seseorang Semakin Tinggi Penghasilan Seseorang)

Apakah kita bisa memastikan kesimpulan itu?? Jelas tidak! …karena jika memang mau diterima secara pasti keberadaan hubungan itu, maka harus dilakukan survey penghasilan dan jenis kendaraan.
Bisa sajaaa…teman saya ini penghasilannya lebih kecil dari saya tetapi dia dapat mobil dari kakak dia atau dari warisan orang tuanya, bisa saja teman saya ini biaya operasional kendaraannya yang bayar isterinya yang penghasilannya besar sedangkan saya meskipun penghasilannya sedikit lebih besar dari dia tetapi punya isteri nganggur jd harus bener2 ekonomis, dan berbagai kemungkinan lainnya…..


Menentukan Correlation 2 variable simply cukup dengan menggambar 2 sumbu: Sumbu X sebagai Variable 1, Sumbu Y sebagai variable ke-2…..selanjutnya titik Koordinat (X,Y) ditentukan antara titik temu Variable 1 dengan Variable 2 untuk tiap2 case/orang/respondent. Nah..sekarang bisa kita lihat pola sebaran Koordinatnya.

Dari kemungkinan pola sebaran koordinat 2 variable, terdapat 4 kemungkinan:

Kemungkinan ke-1: Korelasi nya Linier (garis lurus) Positif (miring ke kanan atau naik)

Kemungkinan ke-2: Korelasi nya Linier (garis lurus) Negatif (miring ke kekiri atau turun)

Kemungkinan ke-3: Korelasi nya Non-Linier (Kurva)

Kemungkinan ke-4: Tidak ada Korelasi (Tidak berpola)

Penjelasannya saya copy kan dari Richard Bowles di



Now let’s extend the comparison so that we are comparing several items, not just two. In this case, we won’t have to presume that there must be a correlation – we will be able to see whether there is one or not! Here is a table showing the results of two examinations set to students that I teach. I set them a maths exam and an English exam and record the scores that they get in both:

John Betty Sarah Peter Fiona Charlie Tim Gerry Martine Rachel
Maths score 72 65 80 36 50 21 79 64 44 55
English score 78 70 81 31 55 29 74 64 47 53

We take a piece of graph paper and draw two axes. The horizontal axis will represent the score on the English exam. The vertical axis will represent the score on the Maths exam. For each student, we then mark a small dot at the co-ordinates representing their two scores. Below, I have done this:

You can see that the points follow a fairly strong pattern. People who are good at maths tend to be good at English as well. The marks lie fairly close to an imaginary straight line that we can draw on the graph. In the diagram below, I have drawn in this straight line, and also included another point (in red) which I will explain later:

The fact that the points lie close to the straight line is called a strong correlation. The fact that this line points upwards to right – indicating that the English mark tends to increase as the maths mark increases – is called a positive correlation.

Line of Best Fit (Regression Line)

The straight line that we draw through the points is called either the line of best fit or the regression line. It describes the relationship between the two variables (the quantities compared) mathematically. There is a standard way to draw this line to ensure that it fits as closely to the data points as possible. Later on, we will investigate exactly what that mathematical way is. For now, we only have to remember one thing:

The regression line goes through the point whose co-ordinates are the mean values of the variables

The arithmetic means are found by adding the relevant scores, and dividing by 10. This is because there are ten students in the table. We work out the arithmetic mean of the maths scores …

mean maths score = (72 + 65 + 80 + 36 + 50 + 21 + 79 + 64 + 44 + 55) / 10 = 56.6

… then we work out the arithmetic mean of the English scores …

mean English score = (78 + 70 + 81 + 31 + 55 + 29 + 74 + 64 + 47 + 53) / 10 = 58.2

and we can be sure that the line must go through the point (56.6, 58.2). This is the point marked in red on the graph above. You will notice that there are roughly the same number of data point lying above this line as there are below it.

We can use the regression line to make predictions. For instance, what English mark would we expect someone to receive if they received a maths mark of 30. If we look at the straight line, we can see that when the maths mark is 30, the English mark is approximately 28. Similarly, we can assume that anyone who got an English mark of 40, would also get a maths mark of about 40. However, there are limits on the predictions that we can make, as you will see later on.

Negative Correlation

In the following table, I have duplicated the maths marks for the ten students and this time added the number of absences from maths lessons for each student:

John Betty Sarah Peter Fiona Charlie Tim Gerry Martine Rachel
Maths score 72 65 80 36 50 21 79 64 44 55
Absences 4 6 0 13 8 15 2 3 9 5
In this case, the scattergram looks like this. I have added the regression line. Again, there is a good correlation between the maths scores and the absences from maths lessons, except that as the number of absences increases, the maths score goes down. This is referred to as negative correlation. Again, we can use the line of best fit to make predictions. What score would a student have received if he had been absent 10 times. According to the graph, it would have been about 41. If a student received a mark of 30, how many times would you expect him to have been absent? From the graph, it seems to be about 13 times.

However, this graph shows well the limitations of making predictions. What score would someone have received if they had been absent for all 30 maths lessons? According to the graph, the score would be less than zero! Similarly, how many times would a student have had to be absent in order to gain a score of 90? Well, the line hits the horizontal axis when the score is just over 80, so in order to get a score of 90, a student would have to be absent a negative number of times. Clearly, these conclusions are stupid, and they lead us to another general principle:

You can only use linear regression to draw conclusions about values within the range of the data point themselves. You might just be able to get away with drawing conclusions about values just outside that range, but the further away from the data range you move, the less reliable the conclusions become!

No correlation

Finally, one more table, this time showing the English marks compared with the average length of time the students spend travelling to college each morning, recorded in minutes.

John Betty Sarah Peter Fiona Charlie Tim Gerry Martine Rachel
English score 78 70 81 31 55 29 74 64 47 53
Time 12 32 19 31 30 15 22 10 17 16
In this case, the scattergram shows no particular pattern. It is clear that we can’t draw a straight line anywhere near the data points, and we say that there is no correlation between the length of time taken to travel to college and the final English mark that a student gets. We cannot predict the English mark of any student based on how long it takes him to get to college. Nor can we predict how long it takes a student to get to college given that student’s English mark.

Non-linear correlations

A bus company wanted to discover if there was any relationship between the number of buses it ran and the number of complaints it received. It carried out a survey testing the average number of buses per hour for different days, and the number of complaints that it received on those days. Here are the results:

As you can see, there is a negative correlation between the number of buses per hour and the number of complaints, but in this case, a curved line fits the data better than a straight line. We are about to investigate the rule that lets you fit a straight line to the data points – it is enough to say at this point that similar rules exist which let you fit various curved lines to the data points as well.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: