Skip to main content

Learn Machine Learning Using Python In Data Science - (Part -4)

 Part - 4 :- Students, so now let's add a new tool to our preprocessing toolkit, which is taking care of missing data. So indeed, if we have a look again at our data set data that CSI, we noticed that there is a missing salary here for this specific customer from Germany of 40 years old and who purchased a product.

Download Data.csv :- Download

So generally you don't want to have any missing data in your data set for the simple reason that it can cause some errors when training your machinery model and therefore you must handle them.

A first way is to just ignore the observation by deleting it. That's one method and this actually works. If you have a large dataset and you know, if you have only one percent missing data, you know, removing one percent of the observations won't change much the learning quality of your model.

So one percent is fine, but sometimes you're going to have a lot of missing data and therefore you must handle them the right way. So that was a first way to ignore them, to remove them. And now a second way, and this is what we're adding right now in the toolkit, is to actually replace the missing data, you know, the missing value by the average of all the values in the column in which the data is missing. So here we have a missing salary. What we want to do is to replace this missing salary by the average of all these salaries. This is a classic way of handling missing data.

So here we go, taking care of missing data. Let's create a new code cell and let's replace that missing salary by the average of all the salaries

So to do this, we're going to use the libraries. And actually, I'm about to introduce you to one of the best data science libraries I'm talking about. Scikit learn Learn Scikit Learn is an amazing data science library containing a lot of tools, including a lot of data, preprocessing tools.

We're actually going to first import that simple imputer class. Then we will create an instance, you know, an object of the simple imputer class. This object will allow us to exactly replace this missing salary here by the average of the salaries. And then we will have an updated data set, you know, an updated actually matrix of features, because we will apply this input on the matrix of features only. So we'll have a new matrix of features with no missing data because the missing salary will have been replaced by the average salary.

Well, we're going to start here by going from Scikit learn, which has the name sklearn. So it's sklearn then remember, in order to access a module, we have to add a dot, because actually this simple input class, which we want to import, belongs to a certain module. So I could learn called m put this one input and from this impute. Oh well we're going to import the SimpleImputer class.

Importing scikit learn Module:- 

from sklearn.impute import SimpleImputer

So since we're about to create a new object, well, we have to introduce here a new variable and we're going to call this variable input input, which will be exactly this object of the simple input class.

Now you're going to enter the right arguments in order to replace, indeed, this missing salary by the average of salaries because notes that there are actually many replacements that you could do. You could instead of replacing it by the average salary, you could replace it by the median salary. You know, there is a difference between the average and the median. You could also replace a missing value by the most frequent value. Right.

First, we have to specify which missing values we have to replace. And so that's why we have to enter here. First argument called missing values, which has to be equal to np, you know, the numpy library that none. And that's just to say that we want to replace all the missing value in the data set like this one. This is like an empty value. This is what this means, an empty value. And then the second argument we have to input here is exactly the one saying that indeed the missing values here, you know, the empty values of the data set will be replaced by the mean. And to do this, we have to add the next argument here, which is strategy. And this argument will be equal to in, quote, mean OK. And that's just to say that we want indeed to replace all the missing values in the matrix of features by the mean of the feature itself.

imputer = SimpleImputer(missing_values = np.nan , strategy = 'mean)

OK, then almost. Let's step now. Remember, this is just an object. We haven't connected anything yet to our matrix of features. So the next step is indeed to apply this input object on the matrix of features. So how are we going to do that?

Well, remember that a class contains an assembly of instructions, but also some operations and actions which you can apply to other objects or variables. And these are called methods. You know, they're like functions. And one of them is exactly the fit method. The fit method will exactly connect this input to the matrix of features. In other words, what this fit method will do is it will look at the missing values in, you know, the salary column and also it will compute the average of the salaries.

Let's first call the fit method in order to do this. Well, of course, we have to go first, our object imputer and then from this object, you know, adding a that we will call the fit method, which has some parentheses because it's like a function inside of class.
And what does it function expect as arguments? Well, it simply expect all the columns of X with numerical values, but only the ones with numerical values, not the ones with text or strings categories.

Well, how do we get this column? Well, first, let's get a matrix of features X, because that's where we want to replace the missing data. And from this matrix of features X, well, first we're going to look at all the rows.

You know, this fit method will read the whole column that we specify inside this method. But then for the columns here, you know, we could specify all the columns where to look for some missing data. However, this first column has danger, you know, it is column with strings. And therefore, this might cause a warning or an error when looking for some missing data here.

Therefore, we are only going to specify these columns with only real numbers, age and salary. And therefore, here we are going to enter the range from one to be careful not to because remember, the upper bound of a range and pattern is excluded. So if we exclude two, this will exclude the salary. Therefore we have to go up to three, So that well, this fit method. We'll look for all the missing values in the age column and the salary column. So here we're specifying specific columns, which are the each column and the salary column. And that's because we know that there is a missing salary. And by the way, there is also a missing age.

However, I recommend on a general rule to select all the numerical columns, because in your career you will actually work with huge data sets and you won't be able to see where the missing values are. So just include all the numerical columns to make sure to replace any missing data. Remember to exclude these ones to string columns.

impute.fit(x[:, 1:3])

Then here we go. This will connect our computer to our matrix of features. And now final step.
We have to call the transform method once again from imputer object. And so this transform methods will exactly do that replacement of the missing salary here. By the mean of the salaries and same for the missing age, it will be replaced by the mean of all the ages in the column. And so according to you, what do we have to input here? Well, there is no trap here. We, of course, have to input the columns of X where we want to replace missing data. And so these are the H column and the salary column. And therefore we simply have to input exactly the same as what was input in the fit method.

So we just take this, copy this and paste that inside the transfer method. However, be careful this transfer method actually returns the new updated version of the Matrix of Features X with the two replacements of the missing salary and the missing age. 
And therefore what we want to do now, and that's the last thing we have to do, is to update our matrix of features X and to do this. Well, since this exactly returns these two columns here with that replacement done well, what we want to do to update X is actually to take this, you know, take the second and third column of X Matrix features and. Change it by what will be returned by this transform function of the input object so that the second and third columns of X will be replaced with that average age, an average salary, and therefore the whole matrix of features X will be exactly the same, but with these new average age and average salary.
whole matrix of features X will be exactly the same, but with these new average age and average salary.

x[:,1:3] = imputer.transform(x[:,1:3])

Now, to do so, we're going to create a new code cell where we're going to print the new matrix of Features X, and let's see if indeed that missing value that we can clearly see here as once again and again is replaced in this new version of X. So let's not forget to run these two cells here, this first cell to indeed replace the missing data and now this cell to print the new Matrix X. And there you go. As we can clearly see that missing salary in the previous matrix of features, X wasn't replaced by the average salaries of this column. Indeed, you can check on Google Spreadsheet or Excel that sixty three thousand seven seven seven indeed corresponds to the average of actually these values.

Your input and ouput looks like :-


Now you have another tool in your data processing toolkit. I'm sure you will have to use it several times when processing your future data sets to build machinery models.

So congratulations !

  • And now we're going to proceed to a new tool, which is to encode categorical data. So we'll do that in the next part(Part-5).
                                              Thank You !

Comments

Popular posts from this blog

Learn Machine Learning Using Python In Data Science - (Part -1)

Part -1 :-  Students , In this part we will see little bit of theory about leaning paths , Difference between  AI ,ML and DL as well as regression and  types of regression. CONTENT 1: Learning Paths Hey Data Scientist, Simple Way Learn Machine Learning  is bringing you a new learning experience. We know how difficult it is to carve out a career track so we’re introducing the Simple Way Learn Machine Learning  to guarantee your way to success. This Skill Track is a perfect fit if you: Struggle to determine the skills you need to succeed in this field, Are unsure which courses are right for you, Desire to arrange your learning curve efficiently and on your schedule. Built to deliver streamlined on-the-job success, the Simple Way To Learn Machine Learning    provides structured curriculums for in-demand Machine Learning skills. After completion, All Parts Track students will walk away with the required Machine Learning skills and a complete portfolio of work to showcase in competitive job

Learn Machine Learning Using Python In Data Science - (Part -3)

Part - 3  :- Students, Now let's learn together how to import a data set. As a reminder, we're going to learn how to import the following data set data that CSV, which is a very simple data set of, let's say, a retail company that is doing some analysis on which clients purchased one of their products. Download Data.csv :-  Download So the rows in this data set correspond to different customers of this employee. And for each of these customers, we have the country, they live in their age, their salary and whether or not they purchased the product. OK, so we're going to learn how to import that GSV on Python, using, of course, to Pendas library. Importing The Dataset :-  So let's first create a new code cell and now let's import this dataset. So the first thing we have to do is to create a new variable and this variable will contain exactly the dataset. Since now we're importing the data set and we want to integrate the data set in a variable, I'm going t