How to do data pre-processing in production

I am a software eng traditionally but recently became a data scientist... looking for ideas on how to do data pre processing in production.. the use case is as follows:

  1. A new data row becomes available every few minutes

  2. It needs to be preprocessed before it can be injested by downstream models.

Creating an OOP-style data pre processor class does not sit well with me because it does not make sense that there is a "class" structure here. All of the methods end up being static (or non static but dont end up using the class state so practically static)... so really what is the point of the class? At the same time Python is OOP language making it difficult to structure things in a class-less manner...

Wondering if anyone faced similar scenarios and what are some good practices here?

Edit: Thank you all!! Awesome responses from everyone that are definitely making me think better about this task

πŸ‘︎ 9
πŸ’¬︎
πŸ‘€︎ u/gougie2
πŸ“…︎ Dec 31 2021
🚨︎ report
Complex Data Pre-Processing

I'm currently creating a template for an enterprise-grade wireless system, however some of the information presented over SNMP isn't in a particularly useful format. The bit that I'm stuck on is creating a discovery rule/item prototype for the number of clients connected to each specific AP or SSID.

Items I want to create/prototype:

  • {#APNAME} number of associated clients
  • {#SSID} number of associated clients

Relevant information I can get over SNMP:

Client-based OID

  • Client MAC address
  • MAC address of AP client is associated with
  • SSID client is associated with
  • Client status (Active/Time Out)

AP-based OID

  • AP MAC address
  • AP Name

As far as I can see, I would need to do the following steps:

  1. Return list of clients and the AP they are associated with
  2. Look for unique AP MAC addresses, and count the number of instances of each

At this point, I have the number of associations per-AP, based on the AP's MAC address. This process could be repeated for the SSID item with no further processing. For the AP item I would then need to:

  1. Look up AP name, based on the AP MAC address

If anyone is able to shed any light on how to do this, or whether this is just not within the capabilities of Zabbix, I would be hugely appreciative! Thanks in advance.

πŸ‘︎ 3
πŸ’¬︎
πŸ‘€︎ u/Mcw00t
πŸ“…︎ Dec 07 2021
🚨︎ report
Pre-processing audio data with different durations for Deep Learning

In real-world audio datasets, not all files have the same duration / num. of samples. This can be a problem for the majority of Deep Learning models (e.g., CNN), which expect training data with a fixed shape.

In computer vision, there’s a simple workaround when there are images with different sizes: resizing. What about audio data? The solution is more complex.

First, you should decide the number of samples you want to consider for your experiments (e.g., 22050 samples)

Then, when loading waveforms you should ensure that they have as many samples as the expected ones. To ensure this, you can do two things:

  1. cut the waveforms which have more samples than the expected ones;
  2. zero pad the waveforms which have less samples than the expected ones.

Does this feel too abstract?

No worries, in my new video I demonstrate how you can use cutting/padding with audio data in Pytorch and torchaudio.

This video is part of the β€œPyTorch for Audio and Music Processing” series, which aims to teach you how to use PyTorch and torchaudio for audio-based Deep Learning projects.

Video:

https://www.youtube.com/watch?v=WyJvrzVNkOc&list=PL-wATfeyAMNoirN4idjev6aRu8ISZYVWm&index=6

πŸ‘︎ 5
πŸ’¬︎
πŸ‘€︎ u/diabulusInMusica
πŸ“…︎ Jun 17 2021
🚨︎ report
Linfa's release 0.4.0 - with Barnes-Hut t-SNE, PLS family, pre-processing for text data, incremental KMeans, Platt Scaling and many more improvements rust-ml.github.io/linfa/n…
πŸ‘︎ 53
πŸ’¬︎
πŸ‘€︎ u/bytesnake
πŸ“…︎ Apr 28 2021
🚨︎ report
Interview with Fujitsu Cloud Technologies: 80% of Data Science is Pre-processing lionbridge.ai/articles/in…
πŸ‘︎ 201
πŸ’¬︎
πŸ‘€︎ u/Shirappu
πŸ“…︎ Jan 31 2020
🚨︎ report
Totally stuck on how to pre-process, visualise and cluster data

So I have a project to complete using PySpark and I'm at a total loss. I need to retrieve data from 2 APIs (which I've done, see below code). I now need to pre-process and store the data, visualise the number of cases and deaths per day and then perform a k means clustering analysis on one of the data sets identifying which weeks cluster together. This is pretty urgent work given the nature of COVID and I just don't understand how to use PySpark at all and would really appreciate any help you can give me, thanks.

Code for API data request:

# Import all UK data from UK Gov API
from requests import get


def get_data(url):
    response = get(endpoint, timeout=10)

    if response.status_code >= 400:
        raise RuntimeError(f'Request failed: {response.text}')

    return response.json()


if __name__ == '__main__':
    endpoint = (
        'https://api.coronavirus.data.gov.uk/v1/data?'
        'filters=areaType=nation;areaName=England&'
        'structure={"date":"date","newCases":"newCasesByPublishDate","newDeaths":"newDeaths28DaysByPublishDate"}'
    )

    data = get_data(endpoint)
    print(data)

# Get all UK data from covid19 API and create dataframe
import json
import requests
from pyspark.sql import *
url = "https://api.covid19api.com/country/united-kingdom/status/confirmed"
response = requests.request("GET", url)
data = response.text.encode("UTF-8")
data = json.loads(data)
rdd = spark.sparkContext.parallelize([data])
df = spark.read.json(rdd)
df.printSchema()

df.show()

df.select('Date', 'Cases').show()

# Look at total cases
import pyspark.sql.functions as F
df.agg(F.sum("Cases")).collect()[0][0]

I feel like that last bit of code for total cases is done correctly but it returns me a result of 2.5 billion cases, I'm at a total loss.

πŸ‘︎ 4
πŸ’¬︎
πŸ‘€︎ u/Modest_Gaslight
πŸ“…︎ Jan 11 2022
🚨︎ report
I created a command line interface tool that automates pre-processing data, creating and using machine learning models. Would anyone like to extend/improve this tool?

Hi y all,

So as the title implies, that's what the tool is about in a nutshell. You can check it out using the link below. I release a stable version this week!

I have many plans to extend/improve the tool and I already opened some issues/features that I find useful/nice to have in case someone wants to try working on something. However, It would be awesome to get feedback from the community.

I wanted to share it here and hopefully, some will find the project useful and would like to help improve the tool. Feel free to give your opinion, open issues or new features.

Github link: https://github.com/nidhaloff/igel

πŸ‘︎ 10
πŸ’¬︎
πŸ‘€︎ u/nidhaloff
πŸ“…︎ Oct 03 2020
🚨︎ report
Data pre processing transformation

So far I have seen different types of data pre processing transformation:

  1. log transform
  2. min/max scale
  3. box cox and yeo johnson
  4. normalizing

I am working on a classification problem with two classes.

A) Are some of these transformation "inherently" better than others?

B) I am very interested in using the random forest algorithm (as well as xgboost). Do these kinds of transformations (especially for random forest) really have the potential to improve classification results? Intuitively, I would have thought these transformations are more useful for classical statistics models such as linear regression.

Thanks!

πŸ‘︎ 2
πŸ’¬︎
πŸ‘€︎ u/jj4646
πŸ“…︎ Oct 16 2020
🚨︎ report
Several activities of the Open Science Festival (Kickstarting a series of free online skills and knowledge sessions in 2021 to help researchers get started with Open Science; #OpenScience #Barcamp; importance of data pre-processing) accelerateopenscience.nl/…
πŸ‘︎ 20
πŸ’¬︎
πŸ‘€︎ u/GrassrootsReview
πŸ“…︎ Oct 29 2020
🚨︎ report
80% of Data Science is Pre-processing says Fujitsu Cloud Technologies lionbridge.ai/articles/in…
πŸ‘︎ 77
πŸ’¬︎
πŸ‘€︎ u/LimarcAmbalina
πŸ“…︎ Jan 31 2020
🚨︎ report
Bioinformatics Project - Data Science for Drug Discovery Part 1 | Data Collection and Pre-Processing youtube.com/watch?v=plVLR…
πŸ‘︎ 24
πŸ’¬︎
πŸ‘€︎ u/dataprofessor
πŸ“…︎ May 12 2020
🚨︎ report
How the US census led to the first data processing company 125 years ago arstechnica.com/science/2…
πŸ‘︎ 3k
πŸ’¬︎
πŸ‘€︎ u/ourlifeintoronto
πŸ“…︎ Jan 02 2022
🚨︎ report
Tensorflow 2.0: Pre-processing input data generated from ImageDataGenerator before putting to model.fit_generator

I'm using ImageDataGenerator from tf.keras (Tensorflow 2.0) to generate data for my training. The raw data contains images and the corresponding segmentation labels (all in RGB format). The sample code with my intuitive approach as below:

....

    # Create two instances with the same arguments
    data_gen_args = dict(
        rotation_range=30,
        width_shift_range=0.2,
        height_shift_range=0.2,
        shear_range=0.2,
        zoom_range=0.2,
        validation_split=0.2
    )
  
    image_datagen = ImageDataGenerator(**data_gen_args)
    label_datagen = ImageDataGenerator(**data_gen_args)

    # Provide the same seed and keyword arguments to the fit and flow methods
    seed = 2020

    image_train_generator = image_datagen.flow_from_directory(
        IMAGES_DIR,
        target_size=(IMAGE_SIZE, IMAGE_SIZE),
        class_mode=None,
        batch_size=BATCH_SIZE,
        seed=seed,
        subset="training")

    label_train_generator = label_datagen.flow_from_directory(
        LABELS_DIR,
        target_size=(IMAGE_SIZE, IMAGE_SIZE),
        class_mode=None,
        batch_size=BATCH_SIZE,
        seed=seed,
        subset="training")

    # https://github.com/keras-team/keras/issues/13123#issuecomment-529498919
    train_generator = (pair for pair in zip(image_train_generator, label_train_generator))

    image_val_generator = image_datagen.flow_from_directory(
        IMAGES_DIR,
        target_size=(IMAGE_SIZE, IMAGE_SIZE),
        class_mode=None,
        batch_size=BATCH_SIZE,
        seed=seed,
        subset="validation")

    label_val_generator = label_datagen.flow_from_directory(
        LABELS_DIR,
        target_size=(IMAGE_SIZE, IMAGE_SIZE),
        class_mode=None,
        batch_size=BATCH_SIZE,
        seed=seed,
        subset="validation")

    validation_generator = (pair for pair in zip(image_val_generator, label_val_generator))

    model.fit_generator(
        train_generator,
        steps_per_epoch=STEPS_PER_EPOCH,
        epochs=EPOCHS,
        callbacks=callbacks,
        validation_data=validation_generator,
        validation_steps=VADIDATION_STEP)

However, I need to pre-process images and labels first before putting it to fit_generator(): each image has to

... keep reading on reddit ➑

πŸ‘︎ 2
πŸ’¬︎
πŸ‘€︎ u/nguyendhn
πŸ“…︎ Jun 24 2020
🚨︎ report
Pre-processing models for time series data

If we should select data from the same distribution for test & train sets, how do we approach splitting our data into dev/test and train sets for time series data such as stock prices or Brain waves?

Thanks.

πŸ‘︎ 2
πŸ’¬︎
πŸ‘€︎ u/bci-hacker
πŸ“…︎ Jun 30 2020
🚨︎ report
Thanks for post-processing | the game called Pre Dusk am happy with this! v.redd.it/c5s4toum45d81
πŸ‘︎ 569
πŸ’¬︎
πŸ‘€︎ u/OkbaAmrate
πŸ“…︎ Jan 22 2022
🚨︎ report
Audio pros, would you prefer completely clean recordings or recordings with pre processing on it?

I’ve recently gotten into the UAD ecosystem and I’ve been debating about whether or not to take advantage of the fact that you can apply pre processing through the interface. As an amateur, I don’t really feel confident enough to apply any eq or compression to the recording and would prefer to just do it in the DAW. I know it’s very common to apply pre processing on things like vocals for example to save time later down the line.

I’d love to hear some opinions on which you prefer. Thanks!

πŸ‘︎ 62
πŸ’¬︎
πŸ‘€︎ u/greg_from_utah
πŸ“…︎ Dec 29 2021
🚨︎ report
Data Pre-Processing in Python: How I learned to love parallelized applies with Dask and Numba medium.com/@ernestk.socia…
πŸ‘︎ 76
πŸ’¬︎
πŸ‘€︎ u/gthank
πŸ“…︎ Mar 08 2018
🚨︎ report
Can someone help me understand why data batch processing and data streaming processing pose such different challenges in data management?

I am a ds and I see the differences mentioned everywhere when it comes to data management. I am having trouble understanding why this is the case if anyone care sharing some insights. Thank you!

πŸ‘︎ 72
πŸ’¬︎
πŸ‘€︎ u/digital-bolkonsky
πŸ“…︎ Jan 29 2022
🚨︎ report
Data Pre-processing in R: Handling Missing Data youtube.com/watch?v=toRjE…
πŸ‘︎ 5
πŸ’¬︎
πŸ‘€︎ u/dataprofessor
πŸ“…︎ Jan 04 2020
🚨︎ report
Pre-processing audio data on GPU

In my new tutorial, you can learn how to pre-process audio data directly on GPU using Pytorch and torchaudio.

This video is part of the β€œPyTorch for Audio and Music Processing” series, which aims to teach you how to use PyTorch and torchaudio for audio-based Deep Learning projects.

Video:

https://www.youtube.com/watch?v=3wD_eocmeXA&list=PL-wATfeyAMNoirN4idjev6aRu8ISZYVWm&index=7

πŸ‘︎ 4
πŸ’¬︎
πŸ‘€︎ u/diabulusInMusica
πŸ“…︎ Jun 21 2021
🚨︎ report
Pre-processing audio data with different durations for Deep Learning

In real-world audio datasets, not all files have the same duration / num. of samples. This can be a problem for the majority of Deep Learning models (e.g., CNN), which expect training data with a fixed shape.

In computer vision, there’s a simple workaround when there are images with different sizes: resizing. What about audio data? The solution is more complex.

First, you should decide the number of samples you want to consider for your experiments (e.g., 22050 samples)

Then, when loading waveforms you should ensure that they have as many samples as the expected ones. To ensure this, you can do two things:

  1. cut the waveforms which have more samples than the expected ones;
  2. zero pad the waveforms which have less samples than the expected ones.

Does this feel too abstract?

No worries, in my new video I demonstrate how you can use cutting/padding with audio data in Pytorch and torchaudio.

This video is part of the β€œPyTorch for Audio and Music Processing” series, which aims to teach you how to use PyTorch and torchaudio for audio-based Deep Learning projects.

Video:

https://www.youtube.com/watch?v=WyJvrzVNkOc&list=PL-wATfeyAMNoirN4idjev6aRu8ISZYVWm&index=6

πŸ‘︎ 3
πŸ’¬︎
πŸ‘€︎ u/diabulusInMusica
πŸ“…︎ Jun 17 2021
🚨︎ report
Pre-processing audio data on GPU for Deep Learning

In my new tutorial, you can learn how to pre-process audio data directly on GPU using Pytorch and torchaudio.

This video is part of the β€œPyTorch for Audio and Music Processing” series, which aims to teach you how to use PyTorch and torchaudio for audio-based Deep Learning projects.

Video:

https://www.youtube.com/watch?v=3wD_eocmeXA&list=PL-wATfeyAMNoirN4idjev6aRu8ISZYVWm&index=7

πŸ‘︎ 2
πŸ’¬︎
πŸ‘€︎ u/diabulusInMusica
πŸ“…︎ Jun 21 2021
🚨︎ report
[PySpark request] Totally stuck on how to pre-process, visualise and cluster data

So I have a project to complete using PySpark and I'm at a total loss. I need to retrieve data from 2 APIs (which I've done, see below code). I now need to pre-process and store the data, visualise the number of cases and deaths per day and then perform a k means clustering analysis on one of the data sets identifying which weeks cluster together. This is pretty urgent work given the nature of COVID and I just don't understand how to use PySpark at all and would really appreciate any help you can give me, thanks.

Code for API data request:

# Import all UK data from UK Gov API
from requests import get


def get_data(url):
    response = get(endpoint, timeout=10)

    if response.status_code >= 400:
        raise RuntimeError(f'Request failed: {response.text}')

    return response.json()


if __name__ == '__main__':
    endpoint = (
        'https://api.coronavirus.data.gov.uk/v1/data?'
        'filters=areaType=nation;areaName=England&'
        'structure={"date":"date","newCases":"newCasesByPublishDate","newDeaths":"newDeaths28DaysByPublishDate"}'
    )

    data = get_data(endpoint)
    print(data)

# Get all UK data from covid19 API and create dataframe
import json
import requests
from pyspark.sql import *
url = "https://api.covid19api.com/country/united-kingdom/status/confirmed"
response = requests.request("GET", url)
data = response.text.encode("UTF-8")
data = json.loads(data)
rdd = spark.sparkContext.parallelize([data])
df = spark.read.json(rdd)
df.printSchema()

df.show()

df.select('Date', 'Cases').show()

# Look at total cases
import pyspark.sql.functions as F
df.agg(F.sum("Cases")).collect()[0][0]

I feel like that last bit of code for total cases is done correctly but it returns me a result of 2.5 billion cases, I'm at a total loss.

πŸ‘︎ 2
πŸ’¬︎
πŸ‘€︎ u/Modest_Gaslight
πŸ“…︎ Jan 11 2022
🚨︎ report
[D] different types of data pre processing transformations

So far I have seen different types of data pre processing transformation:

  1. log transform
  2. min/max scale
  3. box cox and yeo johnson
  4. normalizing

I am working on a classification problem with two classes.

A) Are some of these transformation "inherently" better than others?

B) I am very interested in using the random forest algorithm (as well as xgboost). Do these kinds of transformations (especially for random forest) really have the potential to improve classification results? Intuitively, I would have thought these transformations are more useful for classical statistics models such as linear regression.

Thanks!

πŸ‘︎ 5
πŸ’¬︎
πŸ‘€︎ u/jj4646
πŸ“…︎ Oct 16 2020
🚨︎ report
[PySpark request] Totally stuck on how to pre-process, visualise and cluster data

So I have a project to complete using PySpark and I'm at a total loss. I need to retrieve data from 2 APIs (which I've done, see below code). I now need to pre-process and store the data, visualise the number of cases and deaths per day and then perform a k means clustering analysis on one of the data sets identifying which weeks cluster together. This is pretty urgent work given the nature of COVID and I just don't understand how to use PySpark at all and would really appreciate any help you can give me, thanks.

Code for API data request:

# Import all UK data from UK Gov API
from requests import get


def get_data(url):
    response = get(endpoint, timeout=10)

    if response.status_code >= 400:
        raise RuntimeError(f'Request failed: {response.text}')

    return response.json()


if __name__ == '__main__':
    endpoint = (
        'https://api.coronavirus.data.gov.uk/v1/data?'
        'filters=areaType=nation;areaName=England&'
        'structure={"date":"date","newCases":"newCasesByPublishDate","newDeaths":"newDeaths28DaysByPublishDate"}'
    )

    data = get_data(endpoint)
    print(data)

# Get all UK data from covid19 API and create dataframe
import json
import requests
from pyspark.sql import *
url = "https://api.covid19api.com/country/united-kingdom/status/confirmed"
response = requests.request("GET", url)
data = response.text.encode("UTF-8")
data = json.loads(data)
rdd = spark.sparkContext.parallelize([data])
df = spark.read.json(rdd)
df.printSchema()

df.show()

df.select('Date', 'Cases').show()

# Look at total cases
import pyspark.sql.functions as F
df.agg(F.sum("Cases")).collect()[0][0]

I feel like that last bit of code for total cases is done correctly but it returns me a result of 2.5 billion cases, I'm at a total loss.

πŸ‘︎ 3
πŸ’¬︎
πŸ‘€︎ u/Modest_Gaslight
πŸ“…︎ Jan 11 2022
🚨︎ report
[PySpark request] Totally stuck on how to pre-process, visualise and cluster data

So I have a project to complete using PySpark and I'm at a total loss. I need to retrieve data from 2 APIs (which I've done, see below code). I now need to pre-process and store the data, visualise the number of cases and deaths per day and then perform a k means clustering analysis on one of the data sets identifying which weeks cluster together. This is pretty urgent work given the nature of COVID and I just don't understand how to use PySpark at all and would really appreciate any help you can give me, thanks.

Code for API data request:

# Import all UK data from UK Gov API
from requests import get


def get_data(url):
    response = get(endpoint, timeout=10)

    if response.status_code >= 400:
        raise RuntimeError(f'Request failed: {response.text}')

    return response.json()


if __name__ == '__main__':
    endpoint = (
        'https://api.coronavirus.data.gov.uk/v1/data?'
        'filters=areaType=nation;areaName=England&'
        'structure={"date":"date","newCases":"newCasesByPublishDate","newDeaths":"newDeaths28DaysByPublishDate"}'
    )

    data = get_data(endpoint)
    print(data)

# Get all UK data from covid19 API and create dataframe
import json
import requests
from pyspark.sql import *
url = "https://api.covid19api.com/country/united-kingdom/status/confirmed"
response = requests.request("GET", url)
data = response.text.encode("UTF-8")
data = json.loads(data)
rdd = spark.sparkContext.parallelize([data])
df = spark.read.json(rdd)
df.printSchema()

df.show()

df.select('Date', 'Cases').show()

# Look at total cases
import pyspark.sql.functions as F
df.agg(F.sum("Cases")).collect()[0][0]

I feel like that last bit of code for total cases is done correctly but it returns me a result of 2.5 billion cases, I'm at a total loss.

πŸ‘︎ 2
πŸ’¬︎
πŸ‘€︎ u/Modest_Gaslight
πŸ“…︎ Jan 11 2022
🚨︎ report

Please note that this site uses cookies to personalise content and adverts, to provide social media features, and to analyse web traffic. Click here for more information.