28 Hilarious Data wrangling Puns

Introduction to data wrangling with Raku - Anton Antonov rakuforprediction.wordpre…

👍︎ 11

💬︎

📅︎ Jan 12 2022

[D] Tired of writing mundane data wrangling code.

I find the field of ML and DS to be fascinating, there is so much to learn, read & experiment.

However recently in my day to day I am under pressure to try and deliver product, I find myself writing lots of hacky little notebooks trying various pipelines, libraries, algos etc

95% of the code is very bad and repetitive, concat of numpy arrays, split etc etc

I feel that I am doing something wrong, we cannot all be spending this much time wrangling DFs.

Reaching out to fellow practitioners who maybe had been in this stage and managed to somehow break free of the shackles of data wrangling and get their minds into advanced techniques and papers.

👍︎ 176

💬︎

👤︎ u/ydennisy

📅︎ Oct 13 2021

🚨︎ report

BUSA90520 Data Wrangling & Visualisation

Hello prospective students,

To get ahead of the inevitable queries I will get about my subject, and how hard is it for people with no prior programming experience etc., I figured I would just post info about it here.

Many students who take my subject are concerned that they have no prior programming experience, and most do fine and enjoy the subject, even if it’s challenging.

The Python part is only 3-4 weeks, so we do not go into it really deep. This is good in terms of reducing the risk that if you find this particularly challenging you can’t do reasonably well in the subject, but also makes it more challenging in that there is more you need to learn in a short time period.

60% of the marks comes from completing 3 group projects, only one of which involves Python, so there are 4 brains available to figure out and support each other in completing what you need to do. I make A LOT of consultation time available to coach students through these projects – you will be well supported so long as you don’t leave things to the last minute.

To be honest, I’ve not seen a big correlation between marks in my subject and prior programming experience. In some cases, it can even be a problem as students come with too much of a technical and not a business mindset, or with bad habits they need to unlearn. What does seem to matter, is a student’s ability to think logically and procedurally; that is to say, the ability to break down complexity into a series of small solvable pieces step by step.

The following article is an excellent elaboration on this way of problem solving.

https://www.freecodecamp.org/news/how-to-think-like-a-programmer-lessons-in-problem-solving-d1d8bf1de7d2/

For a longer introduction to how computers work, and therefore how you "talk" to them, you can watch Harvard CS50 Lecture 0:

https://youtu.be/YoXxevp1WRQ

If you want to know more about what’s involved in the area of study, and whether it’s applicable to your personal and career goals, I can recommend the following videos:

Data Analyst vs Business Analysts (My subject sits somewhere in the middle I terms of the skills for these two roles)

https://youtu.be/G4syHs3M82E

Data Analyst vs Data Scientist (We don’t reach the level of technical skills required for Data Science in my subject)

[https://youtu.be/fUpChfNN5Uo](

... keep reading on reddit ➡

👍︎ 40

💬︎

👤︎ u/unimelb_lyle

📅︎ Nov 23 2021

🚨︎ report

Suggestions for Data Wrangling Tools/Books - Dataset w >3000 variables is Making Me Cry

I'm in the midst of a low scale mental breakdown; I got a combo of genetic, lab, a few clinical datasets w a few thousand patients and over a thousand variables each. I've cleaned it and harmonized the data, but I'm unsure how to proceed from there. The first step I want to do is just be able to group related variables, recode some ordinal/nominal vars, etc before I start mining features. Also, I'd like to perform some feature engineering (for example a CV risk variable using some of the metrics in the set).

For identifying potentially important variables, I know I can just repeatedly subsample a small number of features/patients from the aggregated data, apply a model like RF, and eventually rank important features. I'm a little wary of using ML for feature mining on the off chance that it glosses over some key metric.

Is there a tool I should use for getting my data in order? I can use tidyverse but I'm just not a fan, there's gotta be something a little more streamlined (right?).

👍︎ 4

💬︎

👤︎ u/ThiccThrowawayyy

📅︎ Nov 04 2021

🚨︎ report

[MIT] The Missing Semester of Your CS Education - Proficiency with tools YouTube series covering cli, shell, git, profiling, debugging, vim, data wrangling, security & more

Classes teach you all about advanced CS topics, but they rarely teach you proficiency with programming tools. The video series will help you master the command-line, use a powerful text editor, use fancy features of version control systems, and much more! Class homepage

All video recordings of the lectures are available on YouTube.

👍︎ 3k

💬︎

👤︎ u/SwapApp

📅︎ May 14 2021

🚨︎ report

Arquero for Data Wrangling and Plot for Data Visualization in JavaScript

Anybody else see what is happening over in Observable? I do mostly data viz for work and was managing my data outside of the JavaScript environment because of how annoying it is to wrangle data in that programming language for me. I would import csv files and use Vega Lite or D3 to create my graphs.

Now I do a lot of my work with Arquero and Plot, I find myself almost exclusively using these JavaScript libraries for my data viz workflow. I am wondering if anyone else is noticing the JavaScript environment getting way more data friendly for people who used to use R pretty exclusively? I thought I would share my excitement about this stuff for people who have been wanting to get into more data viz but haven't found the right tools.

👍︎ 2

💬︎

👤︎ u/prosocialbehavior

📅︎ Nov 22 2021

🚨︎ report

Arquero for Data Wrangling and Plot for Data Visualization in JavaScript /r/datascience/comments/q…

👍︎ 3

💬︎

👤︎ u/prosocialbehavior

📅︎ Nov 22 2021

🚨︎ report

Data wrangling in base R

I’m in the proces of shifting from Excel/VBA to R. I’ve started with plots and graphs with great success. But still done all data cleaning etc in Excel. I’ve been glad to learn how to do things in base R before learning ggplot2.

The next step is learning how to do data wrangling and data cleaning in R. Would it still be useful to start out by learning how to do it in base R? It seems like everyone end up using using dplyr/tidyr etc. any way. My work involves a lot of columns with date and times FYI.

👍︎ 17

💬︎

👤︎ u/Machine-Objective

📅︎ Aug 22 2021

🚨︎ report

"Shock and Awe" - the Mandalorian Guide to Data Wrangling.

The swirling vortex of hyperspace churned and twisted around a freshly borrowed Lambda class shuttle, one that sported the liverly of Mimban’s new leadership but broadcasted a signature of a far more nefarious intent. Though it looked like a military craft, the egg-heads behind the operation had taken the time to craft a signature familiar to the local pirates, along with a message of valiant hijackings and local resistance to bring forth data and goods to be sold on the black market.

The message was actually in fact a blatant lie, and one that the occupant was actually quite proud of. A lone Mandalorian sat in the cockpit, accompanied by her personal astromech friend and more than a few guns, explosives, and other toys happily donated for the cause. They weren’t much to write home about, most of the Mando’s kit would be more than enough, but it was the thought that counted, especially when that thought was ‘we’re going to pay you a hell of a lot of money and give you all this neat stuff to do things for us.’

Music to Jak’s ears.

The mission tailored to her money-hungry ways was relatively simple, a simple data collection job, but for Jak the job was more simple than that. She was a salvager by trade, finding and scrapping wrecks and all other sorts of assorted machines for a tidy profit, and it didn’t come without a handful of perils, most notably the boobytraps set by pirates, the same pirates she was making a bee-line for. So, while data collection would be all well and good, she intended to do far more than just that.

The shuttle hurtled through hyperspace and the consoles began to read that the journey would soon be coming to an end. Jak pat the top of the astromech, giving him a sign that she was just about ready, then closed her eyes. Under her breath, the pilot speaks a sort of mantra in Mandalorian. A dedication to her brothers in arms, one to her father, one to the clan, and finally one to glory. The Mando stands and straightens the kama around her waist, then pats her holsters, both heavy blasters ready to go.

Somewhere in deep space an old, long abandoned - and by then pirate infested - station floated along, brought to life by a network of mismatched parts, jury-rigged plates, and covered in graffiti boasting of power, piracy, and several mentions of passer-bys’ mothers. Unfortunately for all the good folk in the universe, it was also operational, as consoles began to beep and squeal, its sensors picking up the Lambda swiftly ap

... keep reading on reddit ➡

👍︎ 2

💬︎

👤︎ u/a_friendly_hobo

📅︎ Sep 21 2021

🚨︎ report

Need some advice with data wrangling

I have a dataframe that looks as follows:

1	27339253	27464726	5	10435428	10451691	FF0000
1	27466674	27468028	5	10642238	10643442	FF0000
1	27471793	27543735	5	10646114	10740306	FF0000
1	27581443	27583256	11	5025604	5027566	FF0000
1	27584941	27586390	11	5022784	5023679	FF0000
1	27587123	27590372	11	5016504	5015934	FF0000
2	18550	19218	3	1651707	1652354	BB2A24
2	24167	30328	3	1653362	1655581	BB2A24
2	52587	53948	3	1667994	1669355	BB2A24
2	56307	67413	3	11437921	11453706	BB2A24
2	78561	79161	15	11524109	11524713	BB2A24
2	81341	85649	15	11521263	11523391	BB2A24
2	101700	102198	15	11530925	11531435	BB2A24

I would like to remove all but one of rows that match in columns 1 and 4 but turning columns 2 and 5 into the minimums of their respective ranges and column and columns 3 and 6 into the maximums of their respective ranges.

Desired output :

1	27339253	27543735	5	10435428	10740306	FF0000
1	27581443	27590372	11	5016504	5027566	FF0000
2	18550	67413	3	1651707	1669355	BB2A24
2	78561	102198	15	11521263	11531435	BB2A24

My real file is more than 5000 rows long, so I'd rather not have to do it manually. Any and all suggestions on how to achieve this either with base r or tidyverse functions would be greatly appreciated!

👍︎ 2

💬︎

👤︎ u/corgi_data_wrangler

📅︎ Oct 01 2021

🚨︎ report

Python Data Wrangling for Beginners

Here is a tutorial on how to wrangle data. In this video, I step through the process of extracting data for some basic analysis.

https://youtu.be/mQC2Ac-9Tjw

👍︎ 141

💬︎

👤︎ u/Chris_AIE

📅︎ Jun 17 2021

🚨︎ report

Data Wrangling using ShogunStudio 2

We’re on set shooting multiple cams and multiple segments. What’s the most efficient process to Data Wrangle?

👍︎ 2

💬︎

👤︎ u/boltwill

📅︎ Sep 10 2021

🚨︎ report

Scicloj study sessions this weekend: data wrangling with Tablecloth

Hi.

We are planning some Scicloj study sessions this weekend about data wrangling with Tablecloth.

In the coming weekends, we'll have more sessions about the emerging Clojure data science stack.

This is part of the so-called #ml-study group. More on this and about other study groups -- in this thread.

Here is the current plan of sessions:

The sessions will be 2 hours long and will assume only basic knowledge of Clojure.

Please write to me if you wish to join.

👍︎ 13

💬︎

👤︎ u/daslu

📅︎ Jul 29 2021

🚨︎ report

Azure Synapse - is the Wrangling Data Flow available?

I see articles about "Wrangling Data Flow" - effectively Power Query - in ADF but not in Synapse. I have had a look in Synapse and can't find it. Is it available yet?

👍︎ 3

💬︎

👤︎ u/FickleLife

📅︎ Aug 31 2021

🚨︎ report

Data Wrangling - Multiline Python Statements? Trying to learn best syntax for Python coming from R

Hi,

I have worked with R as my primary language dealing with everything from econometrics/ML applications, GPS analysis using APIs, NLP contract analysis and dealing with unstructured data. I would say if I have a problem and can use R then I can likely devise some way to solve it.

However, I want to move into a new role and I am finding despite my 5+ years with R as my primary language that Python is more preferred when I talk with recruiters, so I have picked it up and I was hoping to get some guidance with good examples of code. In R it is very easy and code 'flows' between statements using the pipe operator, but python I am finding it less intuitive from trial and error to do multistep aggregations, data summarization, etc.

Anyone have a good guide of how to use python for complex data summaries? I want to be able to something like ifelse(col1 %in% ["phrase1", "phrase2"], Col2, NA) and get a count of how many unique values pass that logic. And I might have a half dozen conditional things like this as I have to evaluate many similar statistics for different time periods quite often. Kaggle has some data cleaning, but that data is perfect compared to some of my sources as a consultant.

Obviously this below chunk of code should be broken up across multiple lines, correct?

df_BorState = df[df.BorrowerCity=="San Francisco"].groupby( ["BorrowerState", "BorrowerCity", "BorrowerZip"] ).agg( Tot_Amount = ('CurrentApprovalAmount', 'sum') ).sort_values("Tot_Amount", ascending = False)

Is something like this standardized?

df_BorState = df[df.BorrowerCity=="San Francisco"].groupby(

["BorrowerState", "BorrowerCity", "BorrowerZip"] ).agg(

Tot_Amount = ('CurrentApprovalAmount', 'sum') ).sort_values(

"Tot_Amount", ascending = False)

👍︎ 10

💬︎

👤︎ u/Unhelpful_Scientist

📅︎ Jun 16 2021

🚨︎ report

Data Visualization & Data Wrangling Masterclass with Python idownloadcoupon.com/coupo…

👍︎ 3

💬︎

👤︎ u/smartybrome

📅︎ Jul 16 2021

🚨︎ report

Thoughts on Visualisation and Data Wrangling (BUSA20001) vs Entrepreneurship and Product Innovation (MKTG20007)

Hi! I'm trying to decide between 2 subjects: Visualisation and Data Wrangling (BUSA20001) vs Entrepreneurship and Product Innovation (MKTG20007) but I've not heard much of either of these. I was wondering if anyone could provide insight of these subjects'

-Difficulty

-Quality of course

-Pros & Cons

-Opinions

I'd greatly appreciate some thoughts & insights on those subjects and what I should choose. Thanks in advance!

FYI I'm currently a 2nd year doing B-COM, majoring in Marketing & Management. The options here I'm looking at are just electives to complete the 50 points/sem.

👍︎ 3

💬︎

👤︎ u/Acestar138

📅︎ Jul 09 2021

🚨︎ report

Data wrangling set-up

I work in Arizona and find a actual DIT is a luxury for most shoots. When I get hired by a producer for what they think is DIT work I make it clear I'm a data wrangler and describe what that is. For the most part that's what they are looking for. As of now I bring my mac book with card readers and some adapters I've slowly started to collect. I'm trying to upgrade my data wrangling set up as that's where most of work is before I slowly moving my way into a DIT set up. I was hoping for some suggestions.

👍︎ 40

💬︎

👤︎ u/sammy_jammy

📅︎ Mar 28 2021

🚨︎ report

Data Visualization & Data Wrangling Masterclass with Python freewebcart.com/udemy/dat…

👍︎ 16

💬︎

👤︎ u/abjinternational

📅︎ Jul 18 2021

🚨︎ report

Data Wrangling

How do you wrangle data from moving images or long stills projects?

Upcoming 2-4 day Production Photo&Film mix for me. Gotta expect at least 2 TB of data a day.
How to Backup and Archive the files for agency/production and the creative crew?.

Most definitely wanna avoid any orange drives ;) But planing 2-3 copies a day on individual 2 TB Samsung T5/7s seems like some hustle too. It's not a huge film production with 10+ days. Therefore I feel like a 6bay Raid 10 bunker feels wrong as well.

What would you recommend? (cost effective & fast enough)
Maybe a 2 Bay Raid 0 HDD Setup or Dual SSD Enclosures? Or Samsung/Sanddisk SSDs split for each day of production?

👍︎ 2

💬︎

👤︎ u/Active_Heron1424

📅︎ Jun 09 2021

🚨︎ report

Data Visualization & Data Wrangling Masterclass with Python freewebcart.com/udemy/dat…

👍︎ 6

💬︎

👤︎ u/abjinternational

📅︎ Jul 16 2021

🚨︎ report

wrangling new hospital price transparency data

A new law took into effect stating that hospitals must publish their service prices to "empower patients and increase competition among hospitals". You can read the article here.

I'd like to collect this data and make some sense of it. There are 2 big challenges here: 1) getting the data and 2) making sense of it. I'm figuring out a way to do #1 - I have a list of 3300 hospitals (with the majority including a website) and am figuring out how to find the .csv/.xlsx documents (they are somewhere on the website). This is something I'm pretty confident I can do. However, I have no clue how to wrangle the data - especially since every hospital is going to have it formatted their own way. Here are 3 examples (google docs).

Example 1

Example 2

Example 3

I'm looking for some help here. I'm hoping you guys can give me some direction, or better yet, if someone would like to work with me on this that would be 100x better. The goal is create some meaningful insights/reports for consumers and hospitals.

👍︎ 57

💬︎

👤︎ u/imallinboozie

📅︎ Jan 03 2021

🚨︎ report

What are some of the key things you learned as you shifted into a senior/leader role from a technical BI role? I.e. From data wrangling and presenting to more of a manager/head role in BI?

👍︎ 49

💬︎

👤︎ u/TheDataGentleman

📅︎ Jan 17 2021

🚨︎ report

Creating pivot tables reliant on several .csv files (issues with data wrangling)

Hi everyone,

I haven't used R in a while (and wasn't particularly versed initially) and I'm currently struggling to create a pivot table that relies on several .csv files. More specifically, I have been given a workbook of fish population data that contains 4 tabs; Fish (which contains Effort.Number amongst several measurements and datanotes), Bathymetry, Stratum, and Effort. Essentially, I need to create two pivot tables; showing sample size (N), mean total length, mean round weight and mean assessed age for walleye by net set (effort number), and by depth stratum. For the depth strata-based analysis, I'm asked to use 0-3m; 3-6m; 6-12m; 12-20m for depth strata. I have never created a pivot table in R before (let alone Excel), and I'm a little lost right now.

My thought process is that if I need to create a pivot table reliant on stratum information, I would need to convert the Effort.Number in the fish data to the stratum associated with it. I've done the following after using excel to save the Effort tab as a .csv EffortData and by saving the Fish tab as FishData.csv.

EffortData <- read.csv("C:/Users/saram/OneDrive/Desktop/Fish/EffortData.csv")

> EffortDataStratum <- EffortData %>%

+ mutate(Stratum = case_when(End.Depth..m. <= 2.9 & Start.Depth..m. >= 0 ~ '1',

+ End.Depth..m. <= 5.9 & Start.Depth..m. >= 3 ~ '2',

+ End.Depth..m. <= 11.9 & Start.Depth..m. >= 6 ~ '3',

+ End.Depth..m. <= 20 & Start.Depth..m. >= 12 ~ '4',

+ TRUE ~ 'NA'))

> View(EffortDataStratum)

This gives me 64 results and I (dumbly) tried to just merge this column to my Fish data dataframe, but because the stratum data is only 64 and the Fish Data has 600+ entries (several per effort numbers).

I'm really struggling to wrap my head around how I could create two pivot tables as specified, without the correct length of entries. Any help would be very appreciated.

👍︎ 2

💬︎

👤︎ u/statshelpplz1

📅︎ Jun 13 2021

🚨︎ report

In finance.yahoo.com, they have a date/time format that's not amenable for vlookup, unless I do 20 iterations of data-wrangling...

In one section for option prices on finance.yahoo.com, you can see that the date and time is combined in a user-hostile way, because I'm not easily ale to parse this into a date and a time in two different columns.

The options data maybe found here.

I'd like to do a vlookup between the date and time of this transaction with the price of the stock at that date/time. I'm able to do this, but I have to do 20 iterations of data wrangling. Specifically, I have to "find and replace" "0A" and replace with "0 A"; "1A" with "1 A" and so on until I get to the afternoon transactions for "0P", "1P", "2P", "3P", etc, in which I substitute with "0 P", "1 P", and so on.

I've done vlookups where I looked at the time of the transaction, and compared it to the price of the underlying security at that same time before.

👍︎ 4

💬︎

👤︎ u/Heinrich-Dinkelacker

📅︎ Apr 23 2021

🚨︎ report

How to be a master and fluent in dplyr? Or specifically getting more fluent in data wrangling?

I’ve gone through Hadley’s Data Science With R. And I’m comfortable with tidyverse, for the most part.

But I still feel that sometimes I can do more or rather I should be learning more about this package because most of my work is really data wrangling, more than modeling itself. To use models, I have to prepare the datasets into specific format, specific grouping, specific etc, and this demands a fast and efficient way of data wrangling.

As of now, I solve problems as when I face them. But something tells me that this is not efficient sometimes. It lacks the bigger picture narrative as what Data Science With R has, but I feel it is quite basic sometimes or not much focus on data wrangling per say.

I just discovered tidytable which helps with my sanity to get the fast speed of data.table with the fluency of dplyr.

Maybe this is the problem, it’s hard to talk about data wrangling without context. It’s too generic. But at the same time, when I look at the number of different functions in dplyr documentation, it feels like there’s so much I can learn if I can walk through them with some narratives that build upon each other.

👍︎ 8

💬︎

👤︎ u/levenshteinn

📅︎ Apr 28 2021

🚨︎ report

Data Visualization & Data Wrangling Masterclass with Python idownloadcoupon.com/coupo…

👍︎ 5

💬︎

👤︎ u/smartybrome

📅︎ Jul 16 2021

🚨︎ report

[todayilearned] TIL there is a course from MIT called the “Missing Semester" that goes through the Shell, Vim, Git, data wrangling, etc — all the stuff that CS students are just ‘supposed to know’ but aren’t usually formally taught in a classroom missing.csail.mit.edu/

👍︎ 35

💬︎

👤︎ u/Know_Your_Shit_v2

📅︎ Jun 25 2021

🚨︎ report