In this algorithm, we do not have any target or outcome variable to predict / estimate. It is used for clustering population in different groups, which is widely used for segmenting customers in different groups for specific intervention. Examples of Unsupervised Learning: Apriori algorithm, K-means. Unsupervised learning is something that’s very important, because most of the time, the data that you get in the real world doesn’t have little flags attached that tell you the correct answer.
When we look at this type of data, it looks like there’s clumps or clusters in the data. And if we could identify those clumps or clusters, we could maybe say something about a new, unknown data point and what its neighbours might be like.
This is called unsupervised learning.
The most basic algorithm for clustering, and by far the most used is called K-MEANS.
In k-means, we randomly draw cluster centers and say our first initial guess is as shown in the picture above.
The red points are the data points, and green ones are the assumed centres.
These are obviously not the correct cluster centers, we’re not done yet.
k-means operates in two steps.
1. Assign
2. Optimize
Assignment: We divide the points among the 2 centers depending on the distance. Example: We assign class 1 to those points which are closer to center one than center two.
Optimizing: Minimizing the total quadratic distance of our cluster center to the points. We’re now free to move our cluster centers.
We follow these steps iteratively till we get perfect clusters.
K-means clustering:
class sklearn.cluster.KMeans(n_clusters=8, init=’k-means++’, n_init=10, max_iter=300, tol=0.0001, precompute_distances=’auto’, verbose=0, random_state=None, copy_x=True, n_jobs=1, algorithm=’auto’)
Example code:
from sklearn.cluster import KMeans
clf = KMeans(2)
clf.fit(features)
pred = clf.predict(features)
Important parameters of k-means:
1. n_clusters: The default value for n_clusters is eight. Number of clusters in the algorithm is something that we need to set on our own based on what you think makes sense.
2. max_iter: It’s default value is 300. max_iter actually says how many iterations of the algorithm do you want it to go through as we’re finding the clusters, where we assign each point to a centroid and then we move the centroid.
3. n_init: Is the number of different initializations that you give it. k-means clustering has this challenge, that depending on exactly what the initial conditions are, you can sometimes end up with different clusterings. And so then you want to repeat the algorithm several times so that any one of those clusterings might be wrong, but in general, the ensemble of all the clusterings will give you something that makes sense. That’s what this parameter controls. It’s basically how many times does it initialize the algorithm, how many times does it come up with clusters. By default it goes through at ten times.
Limitations of k-means clustering: local minimum for clustering
Given a fixed data set, given a fixed number of cluster centers, when we run k-means we don’t always arrive at the same result. K-means is what’s called a hill climbing algorithm, and as a result it’s very dependent on where we put your initial cluster centers.
Example:
Here the same points are divided into different types of clusters even if took 3 centers in both the cases, but at different places. So it is important to find the right local minima to form clusters.
References:
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
]]>In regression we are allowed to have arbitrary input values, but the outputs tend to be binary. Now this kind of output is called discrete, but in many learning problems, our output could be continues as well.
So for example, if the input is the height of a person, and you put it output is the weight, then what you find is probably a function. It says that the taller a person, the more a person weights. And in this case the output is not a binary concept like, light or heavy. It’s a continuous concept, and the output itself is also continuous. So this is what we call continuous supervised learning.
The linear regression equation is:
y=mx+b
where y is the output of our predictions, x is the input, m is slope and b is the intercept. We get the output in the form of a line.
Example code:
from sklearn import linear_model reg = linear_model.LinearRegression() reg.fit(feature_train, target_train) print reg.coef_ print reg.intercept_ print reg.score(feature_test, target_test)
To find the coefficients and intercept of the line we can use reg.coef_ and reg.intercept_ like you can in the code above.
To know the performance of our regression we can use score function performed on our regression. One performance metric that we use is r-squared. And the higher our r-squared is, the better. R-squared has a maximum value of one.
Errors is a technical term and that’s the difference in the actual output and the net output that’s predicted by our regression line.
Now the best chance of giving a good fit to the data is by minimizing the sum of the squared error on all the data points. This has the advantages that we get with the absolute value of the error because even if we have an error that’s negative, when we square it, it becomes positive, and of course, if it’s positive to begin with, it’ll still be positive after you square it.
But there’s a problem with SSE. Like in the picture below:
the distribution on the right has a larger sum of squared errors even though it’s probably not doing a much worse job of fitting the data than the distribution on the left. And this is one of the shortcomings of the sum of squared error in general as an evaluation metric.
As we add more data the sum of the squared error will almost certainly go up, but it doesn’t necessarily mean that our fit is doing a worse job.
However, if we are comparing two sets of data that have different number of points in them then this can be a big problem, because if we are using the sum of square errors to figure out which one is being fit better then the sum of squared errors can be jerked around by the number of data points that you’re using, even though the fit might be perfectly fine.
So, we use another metric called R-squared metric in regression.
And, what r squared is, is it’s a number that effectively answers the question, how much of my change in the output is explained by the change in my input?And, the values that r squared can take on, will be between 0 and 1. If the number is very small, that generally means that your regression line isn’t doing a good job of capturing the trend in the data. On the other hand, if r squared is large, close to 1, what that means is your regression line is doing a good job of describing the relationship between your input, or your x variable, and your output, or your y variable.
The whole point of performing a regression is to come up with a mathematical formula that describes this relationship.
The good thing about r squared is that it’s independent of the number of training points. So, it’ll always be between 0 and 1. So, this is a little bit more reliable than a sum of squared errors especially, if the number of points in the data set could potentially be changing.
In the code above, reg.score() gives us the r-squared value of our regression.
Classification vs regression:
1. Output type: Regression- the output variable takes continuous values.
Classification: the output variable takes class labels.
2. What we try to find: In the case of classification this is usually a decision boundary. And then depending on where a point follows relative to that decision boundary you can assign it a class label. With a regression what we’re trying to find is a best fit line.
3. Evaluation: In supervised classification we usually use the accuracy, which is whether it got the class labels correct or not on your test set. And for regression we have different evaluation metrics, one of which is called the sum of a squared error. Another one is called r squared.
References: http://scikit-learn.org/stable/modules/linear_model.html
]]>
PDF(Portable Document Format) is similar to that of the Word document,that saves an electronic version of a document suitable for printing.So it is not really a good format for extraction of the information.Extraction task will be easier if the information that you want to extract has a fixed format like in the below image.
As you can see I had to extract all the above Questions of a PDF file into the excel or a csv file in which each row contains the Question number,Question,options and answer. So, For extracting the textual information.we need to change the complete text information of the PDF file into a single text file or a Html file based on the requirement and usage.The above thing I had done using the python language.Using Text Converter or Html Converter commands we can easily convert the PDF information into the text or Html format.Below is the code for the conversion.
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager();
retstr = StringIO();
codec = ‘utf-8’;
laparams = LAParams();
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams);#Getting the file using the file command..
fp = file(path, ‘rb’);
interpreter = PDFPageInterpreter(rsrcmgr, device);
password = “”;
maxpages = 0;
caching = True;
pagenos=set();for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page);text = retstr.getvalue();
fp.close();
device.close();
retstr.close();
return text;
In the above code if you change the text converter to Html converter then we can get the html file.I have extracted the information from the text file. After getting the text file as we can see if we can get the text from the “(Q” to “Answer >>” Then we can get all the information related to the Question and the options.
The below is the little piece of code that finds the position of the string we are required in to the generator.
def findall(string,sub_string):
print ‘Entered the findall fucntion,,!!’
s = 0;
while True:
s = string.find(sub_string,s);
if s == -1:
return;
yield s;
s += len(sub_string);
And When you are trying to write in the csv file always write the csv file in UTF-8 format or else the spaces will be shown as some special characters that will make the csv file not understandable. If you are using the linux environment then the above mention problem can be resolved by changing the format of the CSV file when opening it to UTF-8 format.As we generally use csvwrite for writing in the CSV files. Instead of that use xlwt commad to write in the csv files which will write in the UTF – 8 format.
Netflix and YouTube makes recommendation using machine learning. When we use credit cards, to make a purchase, there’s fraud protection on them, that’s machine learning as well. The tools that we use everyday are built on top of machine learning.
For machine learning there’s a technique to look at data, to understand data, to be able to apply algorithms. So broadly there are 3 types of Machine Learning algorithms:
This algorithm consist of a target / outcome variable which is to be predicted from a given set of independent variables. Using these set of variables, we generate a function that map inputs to desired outputs. The training process continues until the model achieves a desired level of accuracy on the training data. Examples of Supervised Learning: Regression, Decision Tree, Random Forest, KNN, Logistic Regression etc.
In this algorithm, we do not have any target or outcome variable to predict / estimate. It is used for clustering population in different groups, which is widely used for segmenting customers in different groups for specific intervention. Examples of Unsupervised Learning: Apriori algorithm, K-means.
Using this algorithm, the machine is trained to make specific decisions. It works this way: the machine is exposed to an environment where it trains itself continually using trial and error. This machine learns from past experience and tries to capture the best possible knowledge to make accurate business decisions. Example of Reinforcement Learning: Markov Decision Process
Naïve Bayes Classifier is amongst the most popular learning method grouped by similarities, that works on the popular Bayes Theorem of Probability- to build machine learning models particularly for disease prediction and document classification. It is a simple classification of words based on Bayes Probability Theorem for subjective analysis of content.
When to use the Machine Learning algorithm – Naïve Bayes Classifier?
Applications:
Support Vector Machine is a supervised machine learning algorithm for classification or regression problems where the dataset teaches SVM about the classes so that SVM can classify any new data. It works by classifying the data into different classes by finding a line (hyperplane) which separates the training data set into classes. As there are many such linear hyperplanes, SVM algorithm tries to maximize the distance between the various classes that are involved and this is referred as margin maximization. If the line that maximizes the distance between the classes is identified, the probability to generalize well to unseen data is increased.
SVM’s are classified into two categories:
Advantages of Using SVM
Applications:
SVM is commonly used for stock market forecasting by various financial institutions. For instance, it can be used to compare the relative performance of the stocks when compared to performance of other stocks in the same sector. The relative comparison of stocks helps manage investment making decisions based on the classifications made by the SVM learning algorithm.
It is a type of supervised learning algorithm that is mostly used for classification problems. Surprisingly, it works for both categorical and continuous dependent variables. In this algorithm, we split the population into two or more homogeneous sets. This is done based on most significant attributes/ independent variables to make as distinct groups as possible.
When to use Decision Tree Machine Learning Algorithm
Applications of Decision Tree Machine Learning Algorithm
K-means is a popularly used unsupervised machine learning algorithm for cluster analysis. K-Means is a non-deterministic and iterative method. The algorithm operates on a given data set through pre-defined number of clusters, k. The output of K Means algorithm is k clusters with input data partitioned among the clusters.
Advantages of using K-Means Clustering Machine Learning Algorithm
Applications:
K Means Clustering algorithm is used by most of the search engines like Yahoo, Google to cluster web pages by similarity and identify the ‘relevance rate’ of search results. This helps search engines reduce the computational time for the users.
Linear Regression algorithm shows the relationship between 2 variables and how the change in one variable impacts the other. The algorithm shows the impact on the dependent variable on changing the independent variable. The independent variables are referred as explanatory variables, as they explain the factors the impact the dependent variable. Dependent variable is often referred to as the factor of interest or predictor.
Advantages of Linear Regression Machine Learning Algorithm
Applications of Linear Regression
The name of this algorithm could be a little confusing in the sense that Logistic Regression machine learning algorithm is for classification tasks and not regression problems. The name ‘Regression’ here implies that a linear model is fit into the feature space. This algorithm applies a logistic function to a linear combination of features to predict the outcome of a categorical dependent variable based on predictor variables.
The odds or probabilities that describe the outcome of a single trial are modelled as a function of explanatory variables. Logistic regression algorithms helps estimate the probability of falling into a specific level of the categorical dependent variable based on the given predictor variables.
Just suppose that you want to predict if there will be a snowfall tomorrow in New York. Here the outcome of the prediction is not a continuous number because there will either be snowfall or no snowfall and hence linear regression cannot be applied. Here the outcome variable is one of the several categories and using logistic regression helps.
Based on the nature of categorical response, logistic regression is classified into 3 types –
Random Forest is a trademark term for an ensemble of decision trees. In Random Forest, we’ve collection of decision trees (so known as “Forest”). To classify a new object based on attributes, each tree gives a classification and we say the tree “votes” for that class. The forest chooses the classification having the most votes (over all the trees in the forest).
Applications of Random Forest Machine Learning Algorithms
Apriori algorithm is an unsupervised machine learning algorithm that generates association rules from a given data set. Association rule implies that if an item A occurs, then item B also occurs with a certain probability. Most of the association rules generated are in the IF_THEN format. For example, IF people buy an iPad THEN they also buy an iPad Case to protect it. For the algorithm to derive such conclusions, it first observes the number of people who bought an iPad case while purchasing an iPad. This way a ratio is derived like out of the 100 people who purchased an iPad, 85 people also purchased an iPad case.
Basic principle on which Apriori Machine Learning Algorithm works:
Applications:
Google auto-complete – when the user types a word, the search engine looks for other associated words that people usually type after a specific word.
An artificial neural network (ANN) learning algorithm, usually called “neural network” (NN), is a learning algorithm that is inspired by the structure and functional aspects of biological neural networks. Computations are structured in terms of an interconnected group of artificial neurons, processing information using a connectionist approach to computation. Modern neural networks are non-linear statistical data modelling tools. They are usually used to model complex relationships between inputs and outputs, to find patterns in data, or to capture the statistical structure in an unknown joint probability distribution between observed variables.
Applications:
Random Forest is a trademark term for an ensemble of decision trees. In Random Forest, we’ve collection of decision trees (so known as “Forest”). To classify a new object based on attributes, each tree gives a classification and we say the tree “votes” for that class. The forest chooses the classification having the most votes (over all the trees in the forest).
Applications of Random Forest Machine Learning Algorithms
It can be used for both classification and regression problems. However, it is more widely used in classification problems in the industry. K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases by a majority vote of its k neighbors. The case being assigned to the class is most common amongst its K nearest neighbors measured by a distance function.
These distance functions can be Euclidean, Manhattan, Minkowski and Hamming distance. First three functions are used for continuous function and fourth one (Hamming) for categorical variables. If K = 1, then the case is simply assigned to the class of its nearest neighbor. At times, choosing K turns out to be a challenge while performing KNN modeling.
Things to consider before selecting KNN:
Applications:
It is a Supervised Classification algorithm used to analyze data. Vladimir Vapnik invented the support vector machine.
At first approximation, what Support Vector Machines do is find a separating line, or more generally called a hyperplane, between data of two classes. So, suppose we have some data of two different classes. Support Vector Machine is an algorithm that takes this data as an input, and outputs a line that separates those classes in the best way, if possible.
The margin is the distance between the line and the nearest point of either of the two classes. The hyperplane made by SVM maximizes the margin.
Now to understand the correct result of a support vector machine look at the example below:
Line A does maximize the margin, in some sense to all the data points but it makes a classification error, the red x is on the wrong side of the green line. Whereas in line B, all the points are classified correctly.
Support vector machine puts first and foremost the correct classification of the labels, and then maximize the margin. So for support vector machines, you are trying to classify correctly, and subject to that constraint, you maximize the margin.
Even if there’s point that can’t be classified correctly, while retaining its largest margin, SVM will treat it as outliers, and can safely ignore the points.
Support Vector Machine Algorithm:
SVM just like Naive Bayes is a classification algorithm for binary (two-class) and multi-class classification problems. The technique is easiest to understand when described using binary or categorical input values.
Example code:
from sklearn.svm import SVCclf = SVC(kernel="linear") clf.fit(features_train,labels_train) pred = clf.predict(features_test) from sklearn.metrics import accuracy_score acc = accuracy_score(pred, labels_test)
On applying this algorithm on different data sets we get decision boundary that is classifying points, predict which data belongs to which group and find out the accuracy of our prediction algorithm using sklearn.
If we apply this algorithm in the self driving car problem which was discussed in the previous post link, we get our decision boundary like this:
SVM can do some really complicated shapes to the decision boundary, sometimes even more complicated then you want.
SVM will gives us linear separable if we’re trying to include polynomial features if we’re trying to solve non-linear data. The z features, as two dimensional, will consider the problem as top right, making z capable of linearly separating the graph.
There are functions that take a low dimensional input space or feature space, and map it to a very high dimensional space. So that what’s used to be not linear separable and turn this into a separable problem, these functions are called kernels. These aren’t just functions with a feature space, these are functions over two inputs. And when you apply the kernel trick to change your input space from x,y to a much larger input space, separate the data point using support vector machines, and then take the solution and go back to the original space. You now have a non linear separation.A kernel can be linear, poly, rbf, sigmoid, precomputed, or a callable.
C parameter controls the tradeoff between a smooth decision boundary and one that classifies all the training points correctly. A large value of C means that you’re going to get more training points correct. So what that means in practice is that you get the more intricate decision boundaries with the larger values of C where it can wiggle around individual data points to try to get everything correct.
Gamma defines how far the influence of a single training example reaches. If we have a high value of gamma, the exact details of the decision boundary are going to be dependent only on the closest points and certainly ignoring the faraway points. But if we have low value of gamma, it is better as even the faraway points are taken into account.
Iis a common phenomena in machine learning that happens when you take your data too literal, and when your machine learning algorithm produces something much more complex, as opposed to something very simple.
So, in machine learning we really want to avoid over-fitting. One of the ways that you can control over-fitting is through the parameter of your algorithm.
]]>
Supervised learning is where you have input variables (x) and an output variable (Y) and you use an algorithm to learn the mapping function from the input to the output.
Y = f(X)
The goal is to approximate the mapping function so well that when you have new input data (x) that you can predict the output variables (Y) for that data.
It is called supervised learning because the process of an algorithm learning from the training dataset can be thought of as a teacher supervising the learning process. We know the correct answers, the algorithm iteratively makes predictions on the training data and is corrected by the teacher. Learning stops when the algorithm achieves an acceptable level of performance.
Supervised learning problems can be further grouped into regression and classification problems.
Classification: A classification problem is when the output variable is a category, such as “red” or “blue” or “disease” and “no disease”.
One of the uses of supervised classification is self driving cars. The cars need to be trained when to go fast and when to slow down based on the road terrain. Go fast when the road is smooth and slow down when it’s bumpy.
As an example, we take 750 points in our scatter plot for a self driving car.
What our machine learning algorithm do is they define what’s called a decision surface. And the goal is to draw a decision boundary that will help us distinguish which terrain we need to go slow and which terrain we can go really fast. That means being able to draw a boundary on where we’re able to divide the two classes.
So we have our decision boundary that we can draw between our two classes, and for any arbitrary point, we can immediately classify it as terrain where we have to go slow or terrain where we can drive really fast.
So to make this decision boundary we use the algorithm named Gaussian Naive Bayes Algorithm which uses the Scikit-learn or sklearn Python library.
Gaussian Naive Bayes Algorithm:
Naive Bayes is a classification algorithm for binary (two-class) and multi-class classification problems. The technique is easiest to understand when described using binary or categorical input values.
This algorithm is divided into two phases, Training and Testing.
Example code:
>>> import numpy as np
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> Y = np.array([1, 1, 1, 2, 2, 2])
>>> from sklearn.naive_bayes import GaussianNB
>>> clf = GaussianNB()
>>> clf.fit(X, Y)
GaussianNB(priors=None)
>>> print(clf.predict([[-0.8, -1]]))
[1]
>>> clf_pf = GaussianNB()
>>> clf_pf.partial_fit(X, Y, np.unique(Y))
GaussianNB(priors=None)
>>> print(clf_pf.predict([[-0.8, -1]]))
[1]
On applying this algorithm on different data sets of self driving car example, we get decision boundary that is classifying points like this
We find out how well our algorithm is doing by writing a code to tell what the accuracy is of this naive_bayes classifier that we made.Accuracy is just the number of points that are classified correctly divided by the total number of, of points in the test set.
Example code:
>>> from sklearn.metrics import accuracy_score
>>> accuracy_score(y_true, y_pred)
This way we can predict which data belongs to which group and find out the accuracy of our prediction algorithm using sklearn.
Bayes Rule:
Bayes’ Theorem provides a way that we can calculate the probability of a hypothesis given our prior knowledge.
Now for example, we are given a set of emails written by 2 authors. Then if we are given a new email, using Naive Bayes algorithm we can actually predict the author of the new email.
Strengths of Naive Bayes:
It’s actually really easy to implement with big feature spaces, there is between 20,000 and 200,000 words in the English language.
And it, it’s really simple to run, it’s really efficient.
Weaknesses of Naive Bayes:
It can break.
Historically when Google first sent it out, when people searched for Chicago Bulls, which is a sports team comprised of two words, Chicago Bulls, it would show many images of bulls, animals, and of cities, like Chicago. But Chicago Bulls is something succinctly different.
So phrases that encompass multiple words and have distinctive meanings don’t work really well in Naïve Bayes.
References:
]]>