No items found.
Advanced SQL

R & Python 101: Data Cleaning

However your team does data analysis, there’s a universal truth — the insights you collect are only going to be as good as the data that goes into finding them. Clean data sets are imperative in the analytical process, so data teams spend a lot of time making sure their data is as good as possible before running an analysis. The problem is that the data-cleaning process is long and manual, taking 60 - 80% of a data scientist’s time.

With all that effort spent preparing the data for analysis, these scientists hardly have any time left to actually perform the research and find insights in the data. Advanced coding languages, such as R or Python, include packages that will assist with data cleanup, giving data teams more bandwidth to perform analysis and better tools to dive deeper into the clean data sets.

Using R and Python to clean data better

Scripting languages such as Python and R can assist with data cleanup, allowing data scientists to do bulk cleanup. For example, Python’s re library makes string operations much faster and simpler than using SQL for the same action, dramatically reducing the amount of time and effort that goes into cleaning. Consider a data set with a lot of missing data. Built-in Pandas functions such as fillna and dropna allow data scientists to treat all empty cells in a range the same way. Those cells can be filled with the mean, median or specific values (fillna) or removed entirely (dropna). Other large-scale cleanup activities like removing duplicates can also be handled with individual lines of code rather than the time-intensive processes that must be used to complete the same task in SQL.

Cleanup queries like this are shorter and simpler in the advanced languages than similar queries would be in SQL, which results in fewer resources used and a lighter load on system. If the entire data system is running more efficiently, it frees up resources to run more queries and get to the results faster, which means the data team has more time to search through the resulting data for insights. All of those efficiencies add up to a lot more room for the data team to creatively analyze the data and provide value to a company.

Cleaning Data in Periscope Data

Using R and Python to perform data cleanup in Periscope Data is simple. Just pull the data from SQL and then pass it into one of the more advanced languages in the Periscope editor. From there, a data team can run the efficient, scripted cleanup processes to prepare data for analysis in a fraction of the time. With clean data, the teams can use the time they saved to explore deeper questions about the information and build more advanced charts to illustrate their findings.

To learn more about cleaning data with R and Python inside Periscope Data, download our guide.

Want to discuss this article? Join the Periscope Data Community!

Neha Kumar
Neha is passionate about creating powerful data visualizations, spreading awareness of analytics best practices and learning about the latest and greatest advancements in data science. She strives to leverage multiple programming languages for analyses that are more efficient and effective than current methodologies.