INTRODUCTION 1

INTRODUCTION
1.1 DATA MINING
Data mining is the process of analyzing hidden patterns of data according to different perspectives for categorization into useful information, which is collected and assembled in common areas, such as data warehouses, for efficient analysis, data mining algorithms, facilitating business decision making and other information requirements to ultimately cut costs and increase revenue. Data mining is also known as data discovery and knowledge discovery.
The major steps involved in a data mining process are:
• Extract, transform and load data into a data warehouse
• Store and manage data in a multidimensional databases
• Provide data access to business analysts using application software
• Present analyzed data in easily understandable forms, such as graphs
The first step in data mining is gathering relevant data critical for business. Company data is either transactional, non-operational or metadata. Transactional data deals with day-to-day operations like sales, inventory and cost etc. Non-operational data is normally forecast, while metadata is concerned with logical database design. Patterns and relationships among data elements render relevant information, which may increase organizational revenue. Organizations with a strong consumer focus deal with data mining techniques providing clear pictures of products sold, price, competition and customer demographics.
For instance, the retail giant Wal-Mart transmits all its relevant information to a data warehouse with terabytes of data. This data can easily be accessed by suppliers enabling them to identify customer buying patterns. They can generate patterns on shopping habits, most shopped days, and most sought for products and other data utilizing data mining techniques.
The second step in data mining is selecting a suitable algorithm – a mechanism producing a data mining model. The general working of the algorithm involves identifying trends in a set of data and using the output for parameter definition. The most popular algorithms used for data mining are classification algorithms and regression algorithms, which are used to identify relationships among data elements.
1.2 Educational data mining:
Educational data mining (EDM) describes a research field concerned with the application of data mining, machine learning and statistics to information generated from educational settings (e.g., universities and intelligent tutoring systems). At a high level, the field seeks to develop and improve methods for exploring this data, which often has multiple levels of meaningful hierarchy, in order to discover new insights about how people learn in the context of such settings. In doing so, EDM has contributed to theories of learning investigated by researchers in educational psychology and the learning sciences. The field is closely tied to that of learning analytics, and the two have been compared and contrasted.
Educational data mining refers to techniques, tools, and research designed for automatically extracting meaning from large repositories of data generated by or related to people’s learning activities in educational settings. Quite often, this data is extensive, fine-grained, and precise. For example, several learning management systems (LMSs) track information such as when each student accessed each learning object, how many times they accessed it, and how many minutes the learning object was displayed on the user’s computer screen. As another example, intelligent tutoring systems record data every time a learner submits a solution to a problem; they may collect the time of the submission, whether or not the solution matches the expected solution, the amount of time that has passed since the last submission, the order in which solution components were entered into the interface, etc. The precision of this data is such that even a fairly short session with a computer-based learning environment (e.g., 30 minutes) may produce a large amount of process data for analysis.
In other cases, the data is less fine-grained. For example, a student’s university transcript may contain a temporally ordered list of courses taken by the student, the grade that the student earned in each course, and when the student selected or changed his or her academic major. EDM leverages both types of data to discover meaningful information about different types of learners and how they learn, the structure of domain knowledge, and the effect of instructional strategies embedded within various learning environments. These analyses provide new information that would be difficult to discern by looking at the raw data. For example, analyzing data from an LMS may reveal a relationship between the learning objects that a student accessed during the course and their final course grade. Similarly, analyzing student transcript data may reveal a relationship between a student’s grade in a particular course and their decision to change their academic major.

1.1 Educational Data Mining
Educational data mining is emerging as a research area with a suite of computational and psychological methods and research approaches for understanding how students learn. New computer-supported interactive learning methods and tools—intelligent tutoring systems, simulations, games—have opened up opportunities to collect and analyze student data, to discover patterns and trends in those data, and to make new discoveries and test hypotheses about how students learn. Data collected from online learning systems can be aggregated over large numbers of students and can contain many variables that data mining algorithms can explore for model building.
1.2.1 Users and stakeholders
There are four main users and stakeholders involved with educational data mining. These include:
Learners – Learners are interested in understanding student needs and methods to improve the learner’s experience and performance. For example, learners can also benefit from the discovered knowledge by using the EDM tools to suggest activities and resources that they can use based on their interactions with the online learning tool and insights from past or similar learners. For younger learners, educational data mining can also inform parents about their child’s learning progress. It is also necessary to effectively group learners in an online environment. The challenge is to learn these groups based on the complex data as well as develop actionable models to interpret these groups.

Educators – Educators attempt to understand the learning process and the methods they can use to improve their teaching methods. Educators can use the applications of EDM to determine how to organize and structure the curriculum, the best methods to deliver course information and the tools to use to engage their learners for optimal learning outcomes. In particular, the distillation of data for human judgment technique provides an opportunity for educators to benefit from EDM because it enables educators to quickly identify behavioural patterns, which can support their teaching methods during the duration of the course or to improve future courses.
Researchers – Researchers focus on the development and the evaluation of data mining techniques for effectiveness. A yearly international conference for researchers began in 2008, followed by the establishment of the Journal of Educational Data Mining in 2009. The wide range of topics in EDM ranges from using data mining to improve institutional effectiveness to student performance.
Administrators – Administrators are responsible for allocating the resources for implementation in institutions. As institutions are increasingly held responsible for student success, the administering of EDM applications are becoming more common in educational settings. Faculty and advisors are becoming more proactive in identifying and addressing at-risk students. However, it is sometimes a challenge to get the information to the decision makers to administer the application in a timely and efficient manner.
1.2.2 Goals of EDM:
1. Predicting students’ future learning behavior by creating student models that incorporate such detailed information as students’ knowledge, motivation, metacognition, and attitudes;
2. Discovering or improving domain models that characterize the content to be learned and optimal instructional sequences;
3. Studying the effects of different kinds of pedagogical support that can be provided by learning software; and
4. Advancing scientific knowledge about learning and learners through building computational models that incorporate models of the student, the domain, and the software’s pedagogy.
1.3 Multinomial Naive Bayes Classifier:
One of the most popular applications of machine learning is the analysis of categorical data, specifically text data. Issue is that, there are a ton of tutorials out there for numeric data but very little for texts. Considering how most of my past blogs on Machine Learning were based on Scikit-Learn, I decided to have some fun with this one by implementing the whole thing on my own.
In this blog, I will cover how you can implement a Multinomial Naive Bayes Classifier for the 20 Newsgroups dataset. The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon a messages posted before and after a specific date.
Libraries
First, let us import the libraries needed for writing the implementation:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import operator
Class Distribution
First, we calculate the fraction of documents in each class:

With a multinomial event model, samples (feature vectors) represent the frequencies with which certain events have been generated by a multinomial where is the probability that event i occurs (or K such multinomial in the multiclass case). A feature vector is then a histogram, with counting the number of times event i was observed in a particular instance.
This is the event model typically used for document classification, with events representing the occurrence of a word in a single document (see bag of words assumption). The likelihood of observing a histogram x is given by
If a given class and feature value never occur together in the training data, then the frequency-based probability estimate will be zero. This is problematic because it will wipe out all information in the other probabilities when they are multiplied. Therefore, it is often desirable to incorporate a small-sample correction, called pseudo count, in all probability estimates such that no probability is ever set to be exactly zero. This way of regularizing naive Bayes is called Laplace smoothing when the pseudo count is one, and Lid stone in the general case.
Rennie et al. discuss problems with the multinomial assumption in the context of document classification and possible ways to alleviate those problems, including the use of tf–idf weights instead of raw term frequencies and document length normalization, to produce a naive Bayes classifier that is competitive with support vector machines.
Despite the fact that the far-reaching independence assumptions are often inaccurate, the naive Bayes classifier has several properties that make it surprisingly useful in practice. In particular, the decoupling of the class conditional feature distributions means that each distribution can be independently estimated as a one-dimensional distribution. This helps alleviate problems stemming from the curse of dimensionality, such as the need for data sets that scale exponentially with the number of features.
While naive Bayes often fails to produce a good estimate for the correct class probabilities, this may not be a requirement for many applications. For example, the naive Bayes classifier will make the correct MAP decision rule classification so long as the correct class is more probable than any other class. This is true regardless of whether the probability estimate is slightly, or even grossly inaccurate. In this manner, the overall classifier can be robust enough to ignore serious deficiencies in its underlying naive probability model.