Advanced SQL

R & Python 101: Restructuring Datasets

Over the course of the analysis process, a dataset needs to go through several different transformations. Once the information has been collected and cleaned, the scientists and analysts need to be able to manipulate the information in multiple ways to best represent important insights. Often, this involves long in-depth analysis sessions and several iterations of research to look at a question from new angles.

When the data teams have found an important insight, they still need to be able to visualize that information clearly for relevant stakeholders. When the goal shifts from performing calculations for generating insights to displaying information in the most easy-to-understand way, the data will likely need to be manipulated again. In Python, we call this manipulation pivoting and melting. R can also perform this operation, using the spread and gather functions in the Tidyr library.

To illustrate these concepts, study the two images below. Pivoting (or spreading) is changing the data from the image on the left to look like the image on the right. Melting (or gathering) is the opposite, moving from the image on the right to the image on the left.

Using R and Python to Restructure Datasets

Unlike SQL, R and Python are languages that allow for both preparation and visualization of data. In SQL, queries can be run to produce tables, which can then be passed into a platform like Periscope Data for charting. Since the data is being prepared in one environment and visualized in another, some of the formatting options are lost in the process. The structure of those data tables limits what visuals can be made with the information.

With R and Python, there are libraries that make data manipulation extremely simple, which results in significantly improved visuals. Simply restructuring data tables can have an enormous effect on the overall readability of the information. Making a wide table into a long table allows the series to be mapped more easily. Doing the reverse makes the data easier to read as a table.

These languages also allow queries to be run with less stress on a system. For example, in Python, converting object datatypes to category type reduces the memory used, enabling a faster analyses and faster plotting. However your team analyzes and presents information, the flexibility to shift the data around is crucial to effectively translating insights into operations.

Restructuring Data in Periscope Data

Periscope Data makes it easy for teams to reformat data tables in advanced languages. Just run a SQL query to process the dataset, then pass the table into R or Python and use the reshape2/tidyR or pandas libraries to execute the transformation with a simple command. Using just SQL, manipulations like this would take multiple lines of complex transposing code. In some cases, queries that would take 50-100 lines to perform and run in SQL can be managed in R or Python with just a single line.

Once the data has been transformed, R and Python offer more advanced charting libraries that can create complex, customized visuals for data teams. Visuals that have been created to fit the specifications and preferences of key stakeholders, data teams can pass them directly into Periscope Data to be included in shared dashboards.

To learn more about how you can use R and Python inside Periscope Data to restructure your data for analysis or presentation, download our guide.

Want to discuss this article? Join the Periscope Data Community!

Neha Kumar
Neha is passionate about creating powerful data visualizations, spreading awareness of analytics best practices and learning about the latest and greatest advancements in data science. She strives to leverage multiple programming languages for analyses that are more efficient and effective than current methodologies.