Harbola DataScience

Answer, Read and Post Coding related Questions!

Best DataScience library – Pandas for Machine Learning

What is Pandas in Python?

Pandas is an open source Python package that is most widely used for data science/data analysis and machine learning tasks, which we’ll gonna be covering today. It is built on top of another package named Numpy, which provides support for multi-dimensional arrays. being the most popular data manupulation packages, Pandas works well with many other data science libraries in python environment. You face very less difficulties installing it.

In machine learning you work with data and Pandas create dataframes too easily.

# importing pandas library  
import pandas as pd  
  
# string values in the list   
lst = ['Java', 'Python', 'C', 'C++',  
         'JavaScript', 'Swift', 'Go']  
  
# Calling DataFrame constructor on list  
dframe = pd.DataFrame(lst)  
print(dframe) 

Dataframes use in Python for Machine learning?

Pandas is easy to use and efficient as it makes easy to do many of the time consuming, repetitive tasks associated with working with data, including:

  • Data cleansing – often data that you get for creating model is cluttered and contatins lots and lots of special character, unnecessary space and many other irregularities that you don’t want to apply algo on.
  • Data fill – also you get empty records that can harm your model’s accuracy. You can’t drop that row because of that empty record. So you can use varies data handling techniques to fill that empty spaces on dataset.(mean, median, mode are some basic techniques)
  • Data normalization – standardising your data values is necessary cause it can hamper your model. If the model is trained in kms(kilometre) you need to input the vlaue in kms not in metres(basic eg)
  • Merges and joins – model with high accuracy is created with huge dataset. So many times we need to join dataframes.
  • Data visualization – Images is equivalent to thousand words. And that what here graph does. It show multiple relations which helps us to shortlist algorithms and save our time.
  • Statistical analysis
  • Data inspection
  • Loading and saving data
  • And much more

Pandas is taught to beginner and is even used world’s leading ML eng. as there fav tool for data manipulation and stuff like that.

Important pandas function for machine learning

No need to go through the full documentation of pandas if you are just starting. It will boggle you up. Just do this and start building your projects.

read_csv()

once you have imported pandas as pd you can use following command to bring your dataset to workspace(jupyter notebook or whatever ide you are using)

data_1 = pd.read_csv(r'C:Userschitranshu_dataset.csv')

head()

To print few rows of your dataset you can use pd.head() command and top 5 lines of your dataset will be printed.

describe()

To get an overview of your dataset you can use the following command. It generate descriptive statistics of the data in a Pandas DataFrame or Series.

loc[:]

loc[:] helps to access a group of rows and columns in a dataset, a slice of the dataset, as per our requirement. For instance, if we only want the last 2 rows and the first 3 columns of a dataset, we can access them with the help of loc[:].

If you are familiar with python string slicing, it will be easier for you to implement.

group()

group() is used to group a Pandas DataFrame 1 or more columns, and perform some mathematical operation on it. group() can be used to summarize data in a simple manner.

fillna()

Most often on large dataset you will see that many of records contain Nan values which means not a number.

fillna() helps to replace all NaN values in a DataFrame or Series replacing these missing values with more appropriate values.

data_1['City temp'].fillna(38.5, inplace=True)

The above code will replace all blank “City temp” entries with 38.5. The missing values could be imputed with the mean, median, mode, or some other value. We have chosen mean for our case.

How Pandas is different from NumPy

In easy language, Pandas is used for playing with data. Loading data to python. Creating, deleting rows and colums, displaying top rows of dataframes. Whereas numpy is used for numerical caluculations.

Both of them are used in most of the machine learning projects.

Binding up

These are some of the mostly used pandas functions on every machine learning projects. There also few advanced ones but once you understand how to use them, simply googling them or looking for it on stackoverflow you will get your work done.

Hope this answers your question of 'what are the mostly used pandas function on every machine learning project' 
If you have more such doubts or questions feel free to comment. 
And if you are interested in learning machine learning, deep learning, datascience and stuff related to python mail me @ chitranshuharbola@gmail.com 

Chitranshu Harbola

Self taught programmer, Web Developer and an aspiring Machine learning engineer cum Data Science student

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Back to top