Part - 3 :- Students, Now let's learn together how to import a data set. As a reminder, we're going to learn how to import the following data set data that CSV, which is a very simple data set of, let's say, a retail company that is doing some analysis on which clients purchased one of their products.
So the rows in this data set correspond to different customers of this employee. And for each of these customers, we have the country, they live in their age, their salary and whether or not they purchased the product. OK, so we're going to learn how to import that GSV on Python, using, of course, to Pendas library.
Importing The Dataset :-
So let's first create a new code cell and now let's import this dataset. So the first thing we have to do is to create a new variable and this variable will contain exactly the dataset.
Since now we're importing the data set and we want to integrate the data set in a variable, I'm going to call this variable dataset.
Well, it will be equal to the output of a certain function by pandas, and the certain function will exactly read all the values of this data set and will create what we call a data frame. It's a certain format of data, whether it is in Python, so it will create a data frame and it will contain exactly the same rows and columns and values as what you see here. And the steel frame will be exactly this data set variable.
So since we're about to call the function of Pender's, well, the first thing we have to do is call the business library. And therefore, remember, since we gave it the shortcut name pd, in order to call it, we need to add here and then to call a function from a library, we need to add a dot.
And that's where you can call the function you want to use. And as we said, this function is named read, underscore CSV and then you add some parentheses to enter the argument. So there we go. Let's do this. This will only what you will have to do when using this read this course is the function you have to input in quotes the name of the dataset. As a reminder, the name of the data set is data with the capital did see as. So there we go data that says OK, and this will create the data frame. You know, all the values inside this data set and this data frame will be exactly this data set variable.
dataset = pd.read_csv('Data.csv')
So that's the first step. But that's not enough to import data set, you know, as a first step of data processing.
The next thing that you have to do is create two new entities. The first one is the matrix of features and the second one is the dependent variable vector.
What we want to create, you know, the two entities we want to create are, first, the matrix of features containing separately these three columns here in our country, H salary. And separately, we want to create the dependent variable vector containing only this last column, because that's the column we want to predict. That's exactly what we always have to do in this part-3 to be data preprocessing phase.
Let's create these two entities and we are going to call them X for the matrix of features and Y for the dependent variable vector.
We have our data set, you know, containing exactly all this all these columns. And in order to create X, well, we simply have to take the three first columns of this dataset because, you know, X will be exactly all these values here, you know, with the three first columns.And so what we're simply going to do is play with the indexes to collect indeed the indexes of these three first columns, basically of the columns of all the columns of the data set except the last one.
What you're going to do is take your dataset that exact same variable which you created in this first line of code here, dataset then from this dataset. And I'm adding a dot here because we were about to use a function, you know, one of the attribute functions of a Pender's data frame. And that function is iloc[]. Well, as you can see, iloc[] here stands for locates indexes. what this function will do is it will take the indexes of the columns we want to extract from the dataset, not only indexes of the columns, but also the indexes of the rows. And actually, we have to start here with the rows. We can specify the rows that we want to get and put into X. We want to get all the rows into X.
We only want to take the first columns, but we want to keep all the rows and the trick to take all the rows, whatever data set you have with whatever number of rows is to add here a column.
Why is that?
Because a column in Python means a range. And when we specify a range without the lower bound and neither the upper bound, that means in Python that we're taking everything in the range. Therefore here all the rows.
So now we're going to use a trick so that we can take automatically, you know, regardless of the number of columns in your data, set all the columns except the last one, because all the columns except the last one are exactly the matrix of features. And the trick to do that is to add a new range here, which this time will be Callon minus one [: , :-1].
So what does it mean?
Well, as we said, the column here means the range. We know we're taking a range here on the left. We have nothing. That means that we're taking the first index, you know, the index zero because indexes in both and started zero. And then, you know, we're going up to minus one. So what does this minus one mean? Well, minus one means here.
The last column, minus one[:-1] in Python means the index of the last column, however. And that's a very important principle in Python, which you must absolutely know. A range in Python includes the lower bound, therefore including here lower bound zero, but excludes the upper bound. And therefore, here we're excluding this index minus one, meaning the index of the last column.
Now, you just collected the right indexes to create a matrix of features X. And the beauty of this is that you won't have anything to change when creating the future matrices of features X of your future datasets. But make sure that your future data set indeed have the features in the first columns and the dependent variable vector in the last column.
So in order to finish this line of code, which doesn't add here that values and this just means that we're taking all the values in all the rows of this data set and in all the columns except the last on of this dataset.
X = dataset.iloc[: , :-1].values()
Don't worry if this feels a bit overwhelming at the beginning, I promise you that we will use this trick many, many times. So you will just soon be so familiar with it and master it like a pro.
And now let's do the same for our dependent variable. Actor and, you know, this will be exactly the same, we'll just have to change one little thing.
Y = dataset.iloc[: , -1].values()
what do we have to change here in order to get the dependent variable vector, which is most of the time in our data set, indeed, the last column. Well, this time, since we only want to get one column, we definitely don't want to get a range. And therefore I'm going to remove the range here. And then what are we left here? We're left here with minus one. And as I've told you, minus one is exactly the index of the last column.
That's exactly what we need to create this dependent variable vector and thus this line of code is done. Congratulations.
Now you know how to import dataset, create a matrix of features and create a dependent variable vector. And the cherry on the cake is that any time you want to create these for your dataset, you won't have anything to change because this will automatically take all the first columns for the matrix of features and the last column for the dependent variable vector.
Now I'm going to show you indeed that X and Y will be well created. And in order to do this, we're going to add a new code cell here inside which we're just going to print. So that's the famous print function which allows you to print anything, whether it is a text or, you know, an array like X or vector like Y. So we're going to first print X and then I'm going to add a new code. So here where we're going to print Y, and this is just to show you that indeed X and Y will well be created with this code.
Time for the fun part. We're going to execute all the cells here because, you know, so far we've just written the implementations, but we have to run the cells in order to build on this. So let's first run this code cell importing the libraries.
Run Your Code :- Your code look like (Part 1 to Part 3)
All right. So import it. As you can see, if I click here. Yes, this one here means it is executed. Now, time to run the second one.
But before running this, we have to do something very important. Please Download the Data.csv and put into your project folder.
So now we're going to execute this cell in order to print the matrix of Features X, just to check that indeed we get all the first columns inside this matrix. And indeed, well, let's check the dataset once again.
Remember, the first column is meaning the features we wanted to get into this Matrix X are first the country's second age and third, the salary. These are the three columns. And indeed inside X we have first the country column with all the countries of these customers their age and in the third column, their salary or their estimated salary. So that's perfect. We get indeed the matrix of Features X containing all the features or also called the independent variables. Age and in the third column, their salary or their estimated salary.
And now let's run the cell to print y the dependent vivo vector. And indeed it gets the dependent variable vector containing all the decisions, whether or not the customers purchased the product. Right. We can check. No. Yes. No, no, no. Yes. No, no.
So now, you know, and therefore, congratulations, not only you improve your knowledge of machine learning, but also you now know how important data set and create a matrix of future and independent viable vector.
Output :-
[['France' 44.0 72000.0]
['Spain' 27.0 48000.0] ['Germany' 30.0 54000.0] ['Spain' 38.0 61000.0] ['Germany' 40.0 nan] ['France' 35.0 58000.0] ['Spain' nan 52000.0] ['France' 48.0 79000.0] ['Germany' 50.0 83000.0] ['France' 37.0 67000.0]]
['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']
So now we're going to proceed to the next step, which is a new tool, which I'm going to teach you, and that is taking care of missing data. Let's do this in the next Part(Part - 4). And until then, enjoy machine learning.
Comments
Post a Comment