Pandas memory reports can mislead — learn how to see the real usage and save memory!

sleeping panda
sleeping panda
Source: pixabay.com

The pandas library is THE tool for data cleaning, data prep, and data analysis in Python. Once you find your way around it’s expansive API, it’s a joy to work with. 🎉

Pandas stores your data in memory, which makes operations zippy! 🏎 The downside is that a large dataset might not fit into your machine’s memory, grinding your work to a halt. ☹️

Often, it’s handy to know how much memory your pandas DataFrame occupies. I’ve been working with pandas and teaching it to students for several years. I even wrote a little book on getting started with it…


Shaping and reshaping NumPy and pandas objects to avoid errors

Shape errors are the bane of many folks learning data science. I would bet money that people have quit their data science learning journey due to frustration with getting data into the shape required for machine learning algorithms.

Having a stronger understanding of how to reshape your data will spare you tears, save you time, and help you grow as a data scientist. In this article, you’ll see how to get your data in the shape you need it. 🎉

greece ruins
greece ruins
Athens has many rows and columns. source: pixabay.com

Doing it

First, let’s make sure we’re using similar package versions. Let’s import the libraries we’ll need under their usual aliases. …


A cheat-sheet for the major changes

Scikit-learn version 0.24.0 is packed with new features for machine learning. It arrived just in time for the New Year. Let’s look at the highlights! ☃️

space shuttle flying
space shuttle flying
Source: pixabay.com

1. Faster ways to select hyper-parameters

HalvingGridSearchCV and HalvingRandomSearchCV join GridSearchCV and RandomizedSearchCV as less resource-intense members of the hyper-parameter tuning family.

The new classes choose the best hyper-parameters using a tournament approach.


NumPy is a library that every data scientist who uses Python should be familiar with. It is the backbone on which the modern Python data science stack built.

The library is often picked up in pieces along your learning journey. Eventually, it makes sense to learn the key parts of the library systematically. As a first step, you need to know how to quickly create NumPy arrays to meet your needs. In this article I’ll show you the functions and methods to make NumPy arrays in a snap. 😀

Note: I originally published this article for Deepnote here. You can…


I’m often asked for good places to find data. Here are my t̶e̶n̶ fifteen favorites.

poppies and sun
poppies and sun
Poppies data? Source: pixabay.com

Without further ado, here are the best places to find data, with some helpful information about each. Folks keep pointing me to new sources, so the list is expanding! If you have a favorite, please send it my way! 😀

Awesome Data

Awesome Data is a GitHub repository with a seriously impressive list of datasets separated by category. It is updated regularly.

Data Is Plural

Jeremy Singer-Vine’s Data Is Plural weekly newsletter has great fresh data sources. I’m always impressed by the quality. The archive is available here.

Kaggle Datasets


Plus when to use Miniconda, Anaconda, conda-forge, and pip for a conda good time 😁

Python is the most popular language for data scientists. 🐍 Conda is the most common tool to create a virtual environment and manage packages for data scientists using Python.

Unfortunately, figuring out the best way to get conda on your machine and when to install packages from various channels isn’t straightforward. And it’s not easy to find the most useful commands for using conda and pip all in one place. ☹️

In this article I’m going to provide the essential conda commands and suggestions to help you avoid headaches with installation and use. 🎉

mt. st. helen’s volcano
mt. st. helen’s volcano
Sometimes Python virtual environments and packages feel like a volcano. Source: pixabay.com

Let’s get to it! 🚀

The Need

Whether…


Common keyboard shortcuts for notebooks

JupyterLab is awesome. It has almost everything a data scientist could want.

  • Tabbed windows ✅
  • Split windows ✅
  • Jupyter notebooks ✅
  • Filebrowser ✅
  • Markdown file previews ✅
  • Helpful extensions ✅
  • Widget capabilities ✅
  • Edit .csv files ✅
  • Terminal windows ✅
  • Python scripts ✅
  • Export notebooks in many formats ✅
  • Helpful tutorials ✅

I’ve found it to be missing just one thing — a list of keyboard shortcuts. ☹️

Source: pixabay.com

With keyboard shortcuts, you can whiz around Jupyter notebooks in JupyterLab. You can save time, reduce wrist fatigue from using your mouse, and impress your friends. 🙂

Below is the missing list…


Tips and libraries to speed up your Python code

Dealing with big data can be tricky. No one likes out of memory errors. ☹️ No one likes waiting for code to run. ⏳ No one likes leaving Python. 🐍

Don’t despair! In this article I’ll provide tips and introduce up and coming libraries to help you efficiently deal with big data. I’ll also point you toward solutions for code that won’t fit into memory. And all while staying in Python. 👍

Let’s get to the other side of the bridge! Source: pixabay.com

Python is the most popular language for scientific and numerical computing. Pandas is the most popular for cleaning code and exploratory data analysis.

Using pandas with Python allows…


Each data science project has one of these three goals

What is data science? Is a simple question, but the answers are often confusing. I regularly hear folks say that data science is nothing more than statistics dressed up in fancy clothes. Data science has jokingly been called statistics on a Mac. And a data scientist has been called a data analyst who lives in California. 😂

sunlight, tree, path
sunlight, tree, path
On the quest for understanding. Source: pixabay.com

While these statements are humorous, it’s not at all obvious what data science encompasses. There have been many data science Venn diagrams and many definitions over the years. …


R², RMSE, and MAE

If you’re like me, you might have used R-Squared (R²), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE )evaluation metrics in your regression problems without giving them a lot of thought. 🤔

Although all of them are common metrics, it’s not obvious which one to use when. After writing this article I have a new favorite and a new plan for reporting them going forward. 😀

I’ll share those conclusions with you in a bit. First, we’ll dig into each metric. You’ll learn the pros and cons of each for model selection and reporting. Let’s get to it…

Jeff Hale

I write about data science. Join my Data Awesome mailing list to stay on top of the latest data tools and tips: https://dataawesome.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store