The pandas library is THE tool for data cleaning, data prep, and data analysis in Python. Once you find your way around it’s expansive API, it’s a joy to work with. 🎉
Pandas stores your data in memory, which makes operations zippy! 🏎 The downside is that a large dataset might not fit into your machine’s memory, grinding your work to a halt. ☹️
Often, it’s handy to know how much memory your pandas DataFrame occupies. I’ve been working with pandas and teaching it to students for several years. I even wrote a little book on getting started with it…
Shape errors are the bane of many folks learning data science. I would bet money that people have quit their data science learning journey due to frustration with getting data into the shape required for machine learning algorithms.
Having a stronger understanding of how to reshape your data will spare you tears, save you time, and help you grow as a data scientist. In this article, you’ll see how to get your data in the shape you need it. 🎉
Scikit-learn version 0.24.0 is packed with new features for machine learning. It arrived just in time for the New Year. Let’s look at the highlights! ☃️
The new classes choose the best hyper-parameters using a tournament approach.
NumPy is a library that every data scientist who uses Python should be familiar with. It is the backbone on which the modern Python data science stack built.
The library is often picked up in pieces along your learning journey. Eventually, it makes sense to learn the key parts of the library systematically. As a first step, you need to know how to quickly create NumPy arrays to meet your needs. In this article I’ll show you the functions and methods to make NumPy arrays in a snap. 😀
Note: I originally published this article for Deepnote here. You can…
Without further ado, here are the best places to find data, with some helpful information about each. Folks keep pointing me to new sources, so the list is expanding! If you have a favorite, please send it my way! 😀
Awesome Data is a GitHub repository with a seriously impressive list of datasets separated by category. It is updated regularly.
Python is the most popular language for data scientists. 🐍 Conda is the most common tool to create a virtual environment and manage packages for data scientists using Python.
Unfortunately, figuring out the best way to get conda on your machine and when to install packages from various channels isn’t straightforward. And it’s not easy to find the most useful commands for using conda and pip all in one place. ☹️
In this article I’m going to provide the essential conda commands and suggestions to help you avoid headaches with installation and use. 🎉
Let’s get to it! 🚀
JupyterLab is awesome. It has almost everything a data scientist could want.
I’ve found it to be missing just one thing — a list of keyboard shortcuts. ☹️
With keyboard shortcuts, you can whiz around Jupyter notebooks in JupyterLab. You can save time, reduce wrist fatigue from using your mouse, and impress your friends. 🙂
Dealing with big data can be tricky. No one likes out of memory errors. ☹️ No one likes waiting for code to run. ⏳ No one likes leaving Python. 🐍
Don’t despair! In this article I’ll provide tips and introduce up and coming libraries to help you efficiently deal with big data. I’ll also point you toward solutions for code that won’t fit into memory. And all while staying in Python. 👍
Using pandas with Python allows…
What is data science? Is a simple question, but the answers are often confusing. I regularly hear folks say that data science is nothing more than statistics dressed up in fancy clothes. Data science has jokingly been called statistics on a Mac. And a data scientist has been called a data analyst who lives in California. 😂
While these statements are humorous, it’s not at all obvious what data science encompasses. There have been many data science Venn diagrams and many definitions over the years. …
If you’re like me, you might have used R-Squared (R²), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE )evaluation metrics in your regression problems without giving them a lot of thought. 🤔
Although all of them are common metrics, it’s not obvious which one to use when. After writing this article I have a new favorite and a new plan for reporting them going forward. 😀
I’ll share those conclusions with you in a bit. First, we’ll dig into each metric. You’ll learn the pros and cons of each for model selection and reporting. Let’s get to it…