26 Hilarious Data pre processing Puns

How to do data pre-processing in production

I am a software eng traditionally but recently became a data scientist... looking for ideas on how to do data pre processing in production.. the use case is as follows:

A new data row becomes available every few minutes
It needs to be preprocessed before it can be injested by downstream models.

Creating an OOP-style data pre processor class does not sit well with me because it does not make sense that there is a "class" structure here. All of the methods end up being static (or non static but dont end up using the class state so practically static)... so really what is the point of the class? At the same time Python is OOP language making it difficult to structure things in a class-less manner...

Wondering if anyone faced similar scenarios and what are some good practices here?

Edit: Thank you all!! Awesome responses from everyone that are definitely making me think better about this task

👍︎ 9

💬︎

👤︎ u/gougie2

📅︎ Dec 31 2021

🚨︎ report

Complex Data Pre-Processing

I'm currently creating a template for an enterprise-grade wireless system, however some of the information presented over SNMP isn't in a particularly useful format. The bit that I'm stuck on is creating a discovery rule/item prototype for the number of clients connected to each specific AP or SSID.

Items I want to create/prototype:

{#APNAME} number of associated clients
{#SSID} number of associated clients

Relevant information I can get over SNMP:

Client-based OID

Client MAC address
MAC address of AP client is associated with
SSID client is associated with
Client status (Active/Time Out)

AP-based OID

AP MAC address
AP Name

As far as I can see, I would need to do the following steps:

Return list of clients and the AP they are associated with
Look for unique AP MAC addresses, and count the number of instances of each

At this point, I have the number of associations per-AP, based on the AP's MAC address. This process could be repeated for the SSID item with no further processing. For the AP item I would then need to:

Look up AP name, based on the AP MAC address

If anyone is able to shed any light on how to do this, or whether this is just not within the capabilities of Zabbix, I would be hugely appreciative! Thanks in advance.

👍︎ 3

💬︎

👤︎ u/Mcw00t

📅︎ Dec 07 2021

🚨︎ report

Pre-processing audio data with different durations for Deep Learning

In real-world audio datasets, not all files have the same duration / num. of samples. This can be a problem for the majority of Deep Learning models (e.g., CNN), which expect training data with a fixed shape.

In computer vision, there’s a simple workaround when there are images with different sizes: resizing. What about audio data? The solution is more complex.

First, you should decide the number of samples you want to consider for your experiments (e.g., 22050 samples)

Then, when loading waveforms you should ensure that they have as many samples as the expected ones. To ensure this, you can do two things:

cut the waveforms which have more samples than the expected ones;
zero pad the waveforms which have less samples than the expected ones.

Does this feel too abstract?

No worries, in my new video I demonstrate how you can use cutting/padding with audio data in Pytorch and torchaudio.

This video is part of the “PyTorch for Audio and Music Processing” series, which aims to teach you how to use PyTorch and torchaudio for audio-based Deep Learning projects.

Video:

https://www.youtube.com/watch?v=WyJvrzVNkOc&list=PL-wATfeyAMNoirN4idjev6aRu8ISZYVWm&index=6

👍︎ 5

💬︎

👤︎ u/diabulusInMusica

📅︎ Jun 17 2021

🚨︎ report

Linfa's release 0.4.0 - with Barnes-Hut t-SNE, PLS family, pre-processing for text data, incremental KMeans, Platt Scaling and many more improvements rust-ml.github.io/linfa/n…

👍︎ 53

💬︎

👤︎ u/bytesnake

📅︎ Apr 28 2021

🚨︎ report

Interview with Fujitsu Cloud Technologies: 80% of Data Science is Pre-processing lionbridge.ai/articles/in…

👍︎ 201

💬︎

👤︎ u/Shirappu

📅︎ Jan 31 2020

🚨︎ report

Totally stuck on how to pre-process, visualise and cluster data

So I have a project to complete using PySpark and I'm at a total loss. I need to retrieve data from 2 APIs (which I've done, see below code). I now need to pre-process and store the data, visualise the number of cases and deaths per day and then perform a k means clustering analysis on one of the data sets identifying which weeks cluster together. This is pretty urgent work given the nature of COVID and I just don't understand how to use PySpark at all and would really appreciate any help you can give me, thanks.

Code for API data request:

# Import all UK data from UK Gov API
from requests import get


def get_data(url):
    response = get(endpoint, timeout=10)

    if response.status_code &gt;= 400:
        raise RuntimeError(f'Request failed: {response.text}')

    return response.json()


if __name__ == '__main__':
    endpoint = (
        'https://api.coronavirus.data.gov.uk/v1/data?'
        'filters=areaType=nation;areaName=England&amp;'
        'structure={"date":"date","newCases":"newCasesByPublishDate","newDeaths":"newDeaths28DaysByPublishDate"}'
    )

    data = get_data(endpoint)
    print(data)

# Get all UK data from covid19 API and create dataframe
import json
import requests
from pyspark.sql import *
url = "https://api.covid19api.com/country/united-kingdom/status/confirmed"
response = requests.request("GET", url)
data = response.text.encode("UTF-8")
data = json.loads(data)
rdd = spark.sparkContext.parallelize([data])
df = spark.read.json(rdd)
df.printSchema()

df.show()

df.select('Date', 'Cases').show()

# Look at total cases
import pyspark.sql.functions as F
df.agg(F.sum("Cases")).collect()[0][0]

I feel like that last bit of code for total cases is done correctly but it returns me a result of 2.5 billion cases, I'm at a total loss.

👍︎ 4

💬︎

👤︎ u/Modest_Gaslight

📅︎ Jan 11 2022

🚨︎ report

I created a command line interface tool that automates pre-processing data, creating and using machine learning models. Would anyone like to extend/improve this tool?

Hi y all,

So as the title implies, that's what the tool is about in a nutshell. You can check it out using the link below. I release a stable version this week!

I have many plans to extend/improve the tool and I already opened some issues/features that I find useful/nice to have in case someone wants to try working on something. However, It would be awesome to get feedback from the community.

I wanted to share it here and hopefully, some will find the project useful and would like to help improve the tool. Feel free to give your opinion, open issues or new features.

Github link: https://github.com/nidhaloff/igel

👍︎ 10

💬︎

👤︎ u/nidhaloff

📅︎ Oct 03 2020

🚨︎ report

Data pre processing transformation

So far I have seen different types of data pre processing transformation:

log transform
min/max scale
box cox and yeo johnson
normalizing

I am working on a classification problem with two classes.

A) Are some of these transformation "inherently" better than others?

B) I am very interested in using the random forest algorithm (as well as xgboost). Do these kinds of transformations (especially for random forest) really have the potential to improve classification results? Intuitively, I would have thought these transformations are more useful for classical statistics models such as linear regression.

Thanks!

👍︎ 2

💬︎

👤︎ u/jj4646

📅︎ Oct 16 2020

🚨︎ report

Several activities of the Open Science Festival (Kickstarting a series of free online skills and knowledge sessions in 2021 to help researchers get started with Open Science; #OpenScience #Barcamp; importance of data pre-processing) accelerateopenscience.nl/…

👍︎ 20

💬︎

👤︎ u/GrassrootsReview

📅︎ Oct 29 2020

🚨︎ report

80% of Data Science is Pre-processing says Fujitsu Cloud Technologies lionbridge.ai/articles/in…

👍︎ 77

💬︎

👤︎ u/LimarcAmbalina

📅︎ Jan 31 2020

🚨︎ report

Bioinformatics Project - Data Science for Drug Discovery Part 1 | Data Collection and Pre-Processing youtube.com/watch?v=plVLR…

👍︎ 24

💬︎

👤︎ u/dataprofessor

📅︎ May 12 2020

🚨︎ report

How the US census led to the first data processing company 125 years ago arstechnica.com/science/2…

👍︎ 3k

💬︎

👤︎ u/ourlifeintoronto

📅︎ Jan 02 2022

🚨︎ report

Tensorflow 2.0: Pre-processing input data generated from ImageDataGenerator before putting to model.fit_generator

I'm using ImageDataGenerator from tf.keras (Tensorflow 2.0) to generate data for my training. The raw data contains images and the corresponding segmentation labels (all in RGB format). The sample code with my intuitive approach as below:

....

    # Create two instances with the same arguments
    data_gen_args = dict(
        rotation_range=30,
        width_shift_range=0.2,
        height_shift_range=0.2,
        shear_range=0.2,
        zoom_range=0.2,
        validation_split=0.2
    )
  
    image_datagen = ImageDataGenerator(**data_gen_args)
    label_datagen = ImageDataGenerator(**data_gen_args)

    # Provide the same seed and keyword arguments to the fit and flow methods
    seed = 2020

    image_train_generator = image_datagen.flow_from_directory(
        IMAGES_DIR,
        target_size=(IMAGE_SIZE, IMAGE_SIZE),
        class_mode=None,
        batch_size=BATCH_SIZE,
        seed=seed,
        subset="training")

    label_train_generator = label_datagen.flow_from_directory(
        LABELS_DIR,
        target_size=(IMAGE_SIZE, IMAGE_SIZE),
        class_mode=None,
        batch_size=BATCH_SIZE,
        seed=seed,
        subset="training")

    # https://github.com/keras-team/keras/issues/13123#issuecomment-529498919
    train_generator = (pair for pair in zip(image_train_generator, label_train_generator))

    image_val_generator = image_datagen.flow_from_directory(
        IMAGES_DIR,
        target_size=(IMAGE_SIZE, IMAGE_SIZE),
        class_mode=None,
        batch_size=BATCH_SIZE,
        seed=seed,
        subset="validation")

    label_val_generator = label_datagen.flow_from_directory(
        LABELS_DIR,
        target_size=(IMAGE_SIZE, IMAGE_SIZE),
        class_mode=None,
        batch_size=BATCH_SIZE,
        seed=seed,
        subset="validation")

    validation_generator = (pair for pair in zip(image_val_generator, label_val_generator))

    model.fit_generator(
        train_generator,
        steps_per_epoch=STEPS_PER_EPOCH,
        epochs=EPOCHS,
        callbacks=callbacks,
        validation_data=validation_generator,
        validation_steps=VADIDATION_STEP)

However, I need to pre-process images and labels first before putting it to fit_generator(): each image has to

... keep reading on reddit ➡

👍︎ 2

💬︎

👤︎ u/nguyendhn

📅︎ Jun 24 2020

🚨︎ report

Pre-processing models for time series data

If we should select data from the same distribution for test & train sets, how do we approach splitting our data into dev/test and train sets for time series data such as stock prices or Brain waves?

Thanks.

👍︎ 2

💬︎

👤︎ u/bci-hacker

📅︎ Jun 30 2020

🚨︎ report

Thanks for post-processing | the game called Pre Dusk am happy with this! v.redd.it/c5s4toum45d81

👍︎ 569

💬︎

👤︎ u/OkbaAmrate

📅︎ Jan 22 2022

🚨︎ report

Audio pros, would you prefer completely clean recordings or recordings with pre processing on it?

I’ve recently gotten into the UAD ecosystem and I’ve been debating about whether or not to take advantage of the fact that you can apply pre processing through the interface. As an amateur, I don’t really feel confident enough to apply any eq or compression to the recording and would prefer to just do it in the DAW. I know it’s very common to apply pre processing on things like vocals for example to save time later down the line.

I’d love to hear some opinions on which you prefer. Thanks!

👍︎ 62

💬︎

👤︎ u/greg_from_utah

📅︎ Dec 29 2021

🚨︎ report

Data Pre-Processing in Python: How I learned to love parallelized applies with Dask and Numba medium.com/@ernestk.socia…

👍︎ 76

💬︎

👤︎ u/gthank

📅︎ Mar 08 2018

🚨︎ report

Can someone help me understand why data batch processing and data streaming processing pose such different challenges in data management?

I am a ds and I see the differences mentioned everywhere when it comes to data management. I am having trouble understanding why this is the case if anyone care sharing some insights. Thank you!

👍︎ 72

💬︎

👤︎ u/digital-bolkonsky

📅︎ Jan 29 2022

🚨︎ report

Data Pre-processing in R: Handling Missing Data youtube.com/watch?v=toRjE…

👍︎ 5

💬︎

👤︎ u/dataprofessor

📅︎ Jan 04 2020

🚨︎ report

Pre-processing audio data on GPU

In my new tutorial, you can learn how to pre-process audio data directly on GPU using Pytorch and torchaudio.

This video is part of the “PyTorch for Audio and Music Processing” series, which aims to teach you how to use PyTorch and torchaudio for audio-based Deep Learning projects.

Video:

https://www.youtube.com/watch?v=3wD_eocmeXA&list=PL-wATfeyAMNoirN4idjev6aRu8ISZYVWm&index=7

👍︎ 4

💬︎

👤︎ u/diabulusInMusica

📅︎ Jun 21 2021

🚨︎ report

Pre-processing audio data with different durations for Deep Learning

In real-world audio datasets, not all files have the same duration / num. of samples. This can be a problem for the majority of Deep Learning models (e.g., CNN), which expect training data with a fixed shape.

In computer vision, there’s a simple workaround when there are images with different sizes: resizing. What about audio data? The solution is more complex.

First, you should decide the number of samples you want to consider for your experiments (e.g., 22050 samples)

Then, when loading waveforms you should ensure that they have as many samples as the expected ones. To ensure this, you can do two things:

cut the waveforms which have more samples than the expected ones;
zero pad the waveforms which have less samples than the expected ones.

Does this feel too abstract?

No worries, in my new video I demonstrate how you can use cutting/padding with audio data in Pytorch and torchaudio.

This video is part of the “PyTorch for Audio and Music Processing” series, which aims to teach you how to use PyTorch and torchaudio for audio-based Deep Learning projects.

Video:

https://www.youtube.com/watch?v=WyJvrzVNkOc&list=PL-wATfeyAMNoirN4idjev6aRu8ISZYVWm&index=6

👍︎ 3

💬︎

👤︎ u/diabulusInMusica

📅︎ Jun 17 2021

🚨︎ report

Pre-processing audio data on GPU for Deep Learning

In my new tutorial, you can learn how to pre-process audio data directly on GPU using Pytorch and torchaudio.

This video is part of the “PyTorch for Audio and Music Processing” series, which aims to teach you how to use PyTorch and torchaudio for audio-based Deep Learning projects.

Video:

https://www.youtube.com/watch?v=3wD_eocmeXA&list=PL-wATfeyAMNoirN4idjev6aRu8ISZYVWm&index=7

👍︎ 2

💬︎

👤︎ u/diabulusInMusica

📅︎ Jun 21 2021

🚨︎ report

[PySpark request] Totally stuck on how to pre-process, visualise and cluster data

So I have a project to complete using PySpark and I'm at a total loss. I need to retrieve data from 2 APIs (which I've done, see below code). I now need to pre-process and store the data, visualise the number of cases and deaths per day and then perform a k means clustering analysis on one of the data sets identifying which weeks cluster together. This is pretty urgent work given the nature of COVID and I just don't understand how to use PySpark at all and would really appreciate any help you can give me, thanks.

Code for API data request:

# Import all UK data from UK Gov API
from requests import get


def get_data(url):
    response = get(endpoint, timeout=10)

    if response.status_code &gt;= 400:
        raise RuntimeError(f'Request failed: {response.text}')

    return response.json()


if __name__ == '__main__':
    endpoint = (
        'https://api.coronavirus.data.gov.uk/v1/data?'
        'filters=areaType=nation;areaName=England&amp;'
        'structure={"date":"date","newCases":"newCasesByPublishDate","newDeaths":"newDeaths28DaysByPublishDate"}'
    )

    data = get_data(endpoint)
    print(data)

# Get all UK data from covid19 API and create dataframe
import json
import requests
from pyspark.sql import *
url = "https://api.covid19api.com/country/united-kingdom/status/confirmed"
response = requests.request("GET", url)
data = response.text.encode("UTF-8")
data = json.loads(data)
rdd = spark.sparkContext.parallelize([data])
df = spark.read.json(rdd)
df.printSchema()

df.show()

df.select('Date', 'Cases').show()

# Look at total cases
import pyspark.sql.functions as F
df.agg(F.sum("Cases")).collect()[0][0]

I feel like that last bit of code for total cases is done correctly but it returns me a result of 2.5 billion cases, I'm at a total loss.

👍︎ 2

💬︎

👤︎ u/Modest_Gaslight

📅︎ Jan 11 2022

🚨︎ report

[D] different types of data pre processing transformations

So far I have seen different types of data pre processing transformation:

log transform
min/max scale
box cox and yeo johnson
normalizing

I am working on a classification problem with two classes.

A) Are some of these transformation "inherently" better than others?

B) I am very interested in using the random forest algorithm (as well as xgboost). Do these kinds of transformations (especially for random forest) really have the potential to improve classification results? Intuitively, I would have thought these transformations are more useful for classical statistics models such as linear regression.

Thanks!

👍︎ 5

💬︎

👤︎ u/jj4646

📅︎ Oct 16 2020

🚨︎ report

[PySpark request] Totally stuck on how to pre-process, visualise and cluster data

So I have a project to complete using PySpark and I'm at a total loss. I need to retrieve data from 2 APIs (which I've done, see below code). I now need to pre-process and store the data, visualise the number of cases and deaths per day and then perform a k means clustering analysis on one of the data sets identifying which weeks cluster together. This is pretty urgent work given the nature of COVID and I just don't understand how to use PySpark at all and would really appreciate any help you can give me, thanks.

Code for API data request:

# Import all UK data from UK Gov API
from requests import get


def get_data(url):
    response = get(endpoint, timeout=10)

    if response.status_code &gt;= 400:
        raise RuntimeError(f'Request failed: {response.text}')

    return response.json()


if __name__ == '__main__':
    endpoint = (
        'https://api.coronavirus.data.gov.uk/v1/data?'
        'filters=areaType=nation;areaName=England&amp;'
        'structure={"date":"date","newCases":"newCasesByPublishDate","newDeaths":"newDeaths28DaysByPublishDate"}'
    )

    data = get_data(endpoint)
    print(data)

# Get all UK data from covid19 API and create dataframe
import json
import requests
from pyspark.sql import *
url = "https://api.covid19api.com/country/united-kingdom/status/confirmed"
response = requests.request("GET", url)
data = response.text.encode("UTF-8")
data = json.loads(data)
rdd = spark.sparkContext.parallelize([data])
df = spark.read.json(rdd)
df.printSchema()

df.show()

df.select('Date', 'Cases').show()

# Look at total cases
import pyspark.sql.functions as F
df.agg(F.sum("Cases")).collect()[0][0]

I feel like that last bit of code for total cases is done correctly but it returns me a result of 2.5 billion cases, I'm at a total loss.

👍︎ 3

💬︎

👤︎ u/Modest_Gaslight

📅︎ Jan 11 2022

🚨︎ report

[PySpark request] Totally stuck on how to pre-process, visualise and cluster data

So I have a project to complete using PySpark and I'm at a total loss. I need to retrieve data from 2 APIs (which I've done, see below code). I now need to pre-process and store the data, visualise the number of cases and deaths per day and then perform a k means clustering analysis on one of the data sets identifying which weeks cluster together. This is pretty urgent work given the nature of COVID and I just don't understand how to use PySpark at all and would really appreciate any help you can give me, thanks.

Code for API data request:

# Import all UK data from UK Gov API
from requests import get


def get_data(url):
    response = get(endpoint, timeout=10)

    if response.status_code &gt;= 400:
        raise RuntimeError(f'Request failed: {response.text}')

    return response.json()


if __name__ == '__main__':
    endpoint = (
        'https://api.coronavirus.data.gov.uk/v1/data?'
        'filters=areaType=nation;areaName=England&amp;'
        'structure={"date":"date","newCases":"newCasesByPublishDate","newDeaths":"newDeaths28DaysByPublishDate"}'
    )

    data = get_data(endpoint)
    print(data)

# Get all UK data from covid19 API and create dataframe
import json
import requests
from pyspark.sql import *
url = "https://api.covid19api.com/country/united-kingdom/status/confirmed"
response = requests.request("GET", url)
data = response.text.encode("UTF-8")
data = json.loads(data)
rdd = spark.sparkContext.parallelize([data])
df = spark.read.json(rdd)
df.printSchema()

df.show()

df.select('Date', 'Cases').show()

# Look at total cases
import pyspark.sql.functions as F
df.agg(F.sum("Cases")).collect()[0][0]

I feel like that last bit of code for total cases is done correctly but it returns me a result of 2.5 billion cases, I'm at a total loss.

👍︎ 2

💬︎

👤︎ u/Modest_Gaslight

📅︎ Jan 11 2022

🚨︎ report