Good technical resources about building a data warehouse?

I have worked with BI in the past, but mostly in the analytics part: querying stuff from a data warehouse, generating analysis and such. So I know the very basics of databases and I know basically how PostgreSQL works.

Now I'm in a company that has zero infrastructure regarding a data warehouse. People run stuff on their own computer and store it in Excel and CSV on cloud.

I want to build a data warehouse from scratch, beginning with getting a VM to run procedures and run the database there, but I know nothing about this part. In fact I know so little regarding the technical part here that I don't even know what to search for and can't even elaborate further on what resources I need you to point me to. Someone asked me if I wanted to run a cycle server or not and I don't even know if it's yes or no.

Can anyone point me to good resources to learn the technical part that goes around building a DW?

👍︎ 49

💬︎

👤︎ u/throweyacc2

📅︎ Jan 19 2022

🚨︎ report

[ANECDOTAL] A recent post in r/antiwork inspired me to make a pie chart of my own and compare it to the former that was posted. This data is based on the first job I had, which involved data entry in a warehouse.

👍︎ 17

💬︎

👤︎ u/EntropyFallacy

📅︎ Jan 17 2022

🚨︎ report

How is data mesh better than data warehouse ?

Theres a lot of hype about Data Mesh nowadays. Some people have gone to lengths saying that data warehouse should be replaced with data mesh.

I dont get it. Data mesh is just a way of doing exactly the same thing using data products. No ? Or am i wrong ?

I dont see how data mesh can replace data warehouse. Can anyone please advice.

👍︎ 15

💬︎

👤︎ u/SufficientMeal

📅︎ Jan 17 2022

🚨︎ report

Advice on creating a data warehouse in an old school business that doesn't have much of a data culture yet.

Hi everyone,

I'd be interested to see what those more experienced than me would do in my situation.

I'm currently one of three analysts in a pretty old school business (logistics). They've only just created an analytics department over the past ~3 years, so there are no data engineers and I doubt management would believe that that's even a real job title. I've taught myself SQL and Python and am pretty confident with both (more-so Python than SQL).

I spend most of my time creating reports in PowerBI. This is where I've started noticing issues, namely:

We build the reports based on data straight from our OLTP databases (on-prem SQL Server). I know this isn't best practise and could cause performance issues down the track - I already sometimes crash the servers if my queries are too heavy.
There's data inconsistency between reports depending on who made it or how it's built. For example, I've built out our P&L with a whole lot of tiny transformations (excluding this GL, including this department but not if it's from this site, etc.) which sits in a dataset for that particular report. If I ever want to reference back to say revenue for a particular period in another report, then I have to completely replicate the logic I've built into the P&L report's dataset otherwise those figures will never match. This is causing those who rely on those numbers to lose confidence in the data.
I know that if anyone else ever tried to maintain the reports I've built, there's that many small and subtle transformations that take place for every single dataset that it'd be close to impossible for them to understand unless they knew the data inside and out, let alone make changes to without breaking the report. EG. if you forget to filter for 'customer_field3 = 0' in this particular table, you're actually included deleted transactions, which is incorrect and therefore won't match any of the other reports.
We definitely have a use case for external data - I'm already scraping some data from the web that is really useful but I have no where to put it because I don't have write access to our SQL Server DBs. I also understand that chucking it into an OLTP DB probably isn't the best move anyway and that IT don't want people poking around the DBs we use for our applications, so am currently just keeping the data on a Postgres DB I've set up on my local machine..

I've come to the conclusion that considering these issues a data warehouse would probably

... keep reading on reddit ➡

👍︎ 59

💬︎

👤︎ u/Money_Major

📅︎ Dec 15 2021

🚨︎ report

What's your opinion on data warehouses?

A client department of ours is trying to get our IT department to take on a data warehouse initiative. My colleagues and I have no idea how to build one and the only data warehouses I've seen from parent agencies and my former employer has been abject failure. I feel that /r/programming does not mention data warehouses as they're absolutely nonsense with cultists proposing they exist but then giving no fundamental direction on how to build one. Even the wiki article is confusing on data warehouses and it tends to fall on one citation in particular, Ralph Kimball. We suspect that the client department has been claiming that we are lacking a data warehouse to mask their own performance.

Am I right to be skeptical of data warehouses or is there an actual theory and definitive value add to them that I'm missing?

👍︎ 37

💬︎

👤︎ u/DetriusXii

📅︎ Dec 16 2021

🚨︎ report

Next steps for my data pipeline/warehouse?

Hi! I am a Data Engineer & Analyst at a non profit organization. I'm pretty much the only person with an engineering/coding background and since I don't really have any guidance in my organization I was wondering what I could improve in the implementation I've built with our data warehouse and pipelines.

The setup is a BigQuery DW with some Cloud Functions running Python pipelines to feed data into it. We currently only have a few pipelines setup pulling data from a Postgresql DB, where we just pull the all the data we want from the Postgres DB table(maximum a couple hundred thousand rows containing only text and numbers), truncate the previous version and insert all the data again

This was done because the Postgres tables had no reliable timestamps to handle only inserting deltas. Please advise if there is any other cleaner way to do this!

We also have some more tables in there where data is inserted manually every quarter since we get a big file in a very dirty format, so I came up with a script to clean it but still manually upload it.

We are also considering building another one to get more data from a partner's Redshift instance.

PLEASE NOTE that no modeling was performed at all for this data warehouse, I just figured which tables and columns would be useful for our analytics and pull all these data into BigQuery to feed our Data Studio reports.

There are also no tests being performed in the pipelines and even though I have a git repository for the pipelines, if I want to change the code for the Cloud Function I have to go in there and manually change it for each.

I am just looking for some general guidance on ways to improve this implementation since I feel a bit lost on what the best next steps would be.

Thank you!

👍︎ 8

💬︎

👤︎ u/Fuzwipper

📅︎ Jan 22 2022

🚨︎ report

Kimball in the context of the modern data warehouse: what's worth keeping, and what's not (Dave Fowler, Chartio) youtube.com/watch?v=3OcS2…

👍︎ 19

💬︎

👤︎ u/porcelainsmile

📅︎ Jan 16 2022

🚨︎ report

Guide to read The Data Warehouse Toolkit to save you from reading cover-cover and outdated topics.

https://www.holistics.io/blog/how-to-read-data-warehouse-toolkit/

👍︎ 126

💬︎

👤︎ u/raghukveer

📅︎ Dec 12 2021

🚨︎ report

[ANECDOTAL] A recent post in r/antiwork inspired me to make a pie chart of my own and compare it to the former that was posted. This data is based on the first job I had, which involved data entry in a warehouse.

👍︎ 5

💬︎

👤︎ u/EntropyFallacy

📅︎ Jan 17 2022

🚨︎ report

Castled - an open source reverse ETL solution that helps you to periodically sync the data in your db/warehouse into sales, marketing, support or custom apps without any help from engineering teams github.com/castledio/cast…

👍︎ 16

💬︎

👤︎ u/binaryfor

📅︎ Jan 25 2022

🚨︎ report

Costco finds five card skimmers at four Chicago-area warehouses, warns customers of potential data breach foxbusiness.com/retail/co…

👍︎ 150

💬︎

👤︎ u/lurker_bee

📅︎ Nov 13 2021

🚨︎ report

Data Warehouse Product Management Advice

Hello everyone, I've been tasked with a new endeavor at my job of managing the data warehouse that enables a suite of products to work properly. I am getting up to speed as quickly as I can, but figured it could be beneficial to ask this community for any advice from those who had been in a similar situation.

-Anything you'd wished you'd known before joining a data warehouse team? -Any advice for handling expectations/deliverables with the teams who have products that consume the data? -Is there any way to help people removed from the data warehouse understand that getting the data pipeline ingestion automated and rock solid will enable us to react more on the fly to their fire drills? -Any ceremony practices or work processes that were effective but different than more traditional products whose users are external clients?

Thanks in advance. This is all pretty far outside the realm of the work I've done before.

👍︎ 16

💬︎

👤︎ u/silver-seahawk

📅︎ Dec 15 2021

🚨︎ report

Unified star schema is it holy grail to data warehouse modelling challenges?

Hi Friends

I would like to hear if any one implemented new data ware house modelling approach aka unified star schema, look like model can be built without any business requirement and as I sense it claims to solve most modelling challenges

Any practical comments?

Reference The Unified Star Schema: An Agile and Resilient Approach to Data Warehouse and Analytics Design

👍︎ 11

💬︎

👤︎ u/Shalaini

📅︎ Dec 29 2021

🚨︎ report

Warehouse module//// Need help removing products using data upload or any other means.

Hi all, I am currently handling over thousands of SKU under one organization and all of them have their unit conversion set up.

Client recently changed supplier and all unit conversion are now invalid...

Question is, is there a way to keep the current SKUs, but replace the unit conversion without having to deactivate all SKU and reupload?

All help are appreciated !

👍︎ 2

💬︎

👤︎ u/nanobit88

📅︎ Jan 13 2022

🚨︎ report

[E] Why aren't database and data warehouse courses part of the curriculum of statistics programs?

According to Wikipedia: "Statistics is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data***"***

Understood that collection and organization is covered from surveying, possibly experimental design, etc., but wouldn't it be prudent to be fluent in SQL/databases as part of a statistical curriculum? Especially since these systems are fairly popular and common?

👍︎ 137

💬︎

👤︎ u/Tender_Figs

📅︎ Nov 04 2021

🚨︎ report

Building a Data warehouse from scratch for a manufacturing company

I will be creating a datawarehouse from ground up. Kind of lost on where to begin since I've thought I'm only be doing the maintenance for the current infrastructure.

Setup: I am working for a medium sized (?) manufacturing company with 500 customers currently. Our sales department manually generates daily/weekly/monthly/yearly reports and they get these info from different data sources. They want to do analytics and slice and dice type and we already bought a Tableau for the reporting.

Q:

Is MySQL + Cloud SQL good for this DWH setup. If no, what's your recommendation? What's the best practices? How to begin with integrating an ETL process to fetch all the historical data to a staging layer? What are the things we should install for these?

Appreciate all your assistance!

👍︎ 45

💬︎

👤︎ u/saoirsenov

📅︎ Nov 19 2021

🚨︎ report

Castled - an open source reverse ETL solution that helps you to periodically sync the data in your db/warehouse into sales, marketing, support or custom apps without any help from engineering teams github.com/castledio/cast…

👍︎ 3

💬︎

👤︎ u/binaryfor

📅︎ Jan 25 2022

🚨︎ report

Has anyone built their own personal data warehouse using snowflake for a personal project?

I’ve been working on a personal project for a while which scrapes Reddit comments from a subreddit on a daily basis. I collect specific data and output this in csv format to a raw landing s3 bucket.

Current architecture:

Only daily basis, AWS Eventbridge (previously known as cloud watch events) is scheduled to run at a specific time daily via cron.
This cron triggers an AWS Lambda Function (which uses a Docker image with web scrape code using an image from AWS ECR). This scrapes the subreddit and outputs to s3 landing directory
Aws glue is scheduled to run daily to load Athena table

However, I want to replace step three and use boto3 to load snowflake table instead as it is a more ‘real’ DWH (unlike Athena).

I know at this point with the size of data it is definitely overkill, but it makes future scaling way easier. Has anybody done a personal project using snowflake as the DWH. I would query snowflake not too frequently (less than 30 times a month). Is this smart to consider future scalability because I know I am going to aggregate a lot of data at some point. Input appreciated.

👍︎ 6

💬︎

👤︎ u/jaspar1

📅︎ Dec 23 2021

🚨︎ report

Castled - an open source reverse ETL solution that helps you to periodically sync the data in your db/warehouse into sales, marketing, support or custom apps without any help from engineering teams github.com/castledio/cast…

👍︎ 2

💬︎

👤︎ u/binaryfor

📅︎ Jan 25 2022

🚨︎ report

Productionizing Warehouse Data

Hey everyone, just looking for a little guidance on some issues that are coming up with how my org works with data. I’ll preface all of this by saying I’m a full stack engineer with minimal data engineering experience so I apologize if this is a noob question. We have a variety of data coming in through different sources (airflow jobs, airbyte, fivetran) and it ends up in our (snowflake) warehouse. I’m finding that I’m more often wanting to pull data out of the warehouse to reintegrate into existing api/service databases. Is this a smell? Should I use the warehouse directly instead? I’ve come across the concept of reverse ETL but that seems more like tools that sync data into third party services.

👍︎ 5

💬︎

👤︎ u/rectalrectifier

📅︎ Dec 22 2021

🚨︎ report

Need to change source from Oracle data warehouse to Google BigQuery

Hey all, as the title explain it, I need to convert a few dashboards made from oracle sql to GCP. I did one today, I and did the following:

Rewrote the query Using BigQuery syntax
Created a new PBIX file and connected to my GCP project, imported the data
I copied the whole connection plus query string from the query editor
Opened the original PBIX dashboard, opened the oracle query and clicked “edit query”. In there, i pasted the GCP string and the data refreshed just fine.
Posted the dashboard to PBI service and all seems to be working fine.

I need to do this to other dashboards that have up to 7 oracle queries as different tables. Can I do the same I did not the steps above, Or is there an easier/better way to just change the data source? I’m trying to avoid redoing the dashboards from scratch, or recreating relationships.

Thanks.

👍︎ 2

💬︎

👤︎ u/Financial_Pie_3624

📅︎ Jan 25 2022

🚨︎ report

Seeking advice on data collection for multiple organizations into a centralized location. Would I utilize a data lake, data warehouse, or something else entirely?

I am part of a project that consists of multiple organizations wanting to share data with each other on an ongoing basis. It would consist of the same data elements from each company, and it would be refreshed at a set interval. Ideally, we would like to join all of the collected data together while providing access to the data for each of the companies involved. What would be the best approach for accomplishing this that would also have a high level of security?

👍︎ 5

💬︎

👤︎ u/sneakerhead1310

📅︎ Dec 27 2021

🚨︎ report

Loading a Power Query table to a data warehouse (netezza)

Hi all - recently learned that I could load a csv or flat file to netezza using external table which is really helpful when the data warehouse isn't specified to your needs. Even though I was super thrilled with this discovery, I would like to now load a power query table without having to export to a flat file. Wondering if anyone has done anything like this! Was thinking that I would need to reference the table in power query through some sort of address string.

👍︎ 4

💬︎

👤︎ u/avachris12

📅︎ Jan 02 2022

🚨︎ report

Cloud Data Warehouse Guide firebolt.io/blog/cloud-da…

👍︎ 12

💬︎

👤︎ u/codmanend

📅︎ Jan 09 2022

🚨︎ report

How many shares are there really, and where are they all hiding? 1. Surveys of the USA, Canada and Germany compared to Bloomberg Data, the Brazilian Puts and FTDs. 2. The Obligations Warehouse, where everything gets hid and nothing gets reported.

Edit 1: Apparently I had missed this and related posts indicating the number of Swedish Apes was some 40 thousand, and the German sub is 13 thousand. Ok thats a facepalm for sure. Adjust all estimate numbers upwards by a minimum of 4 times to maintain the proportionality. I will keep the numbers originally written as is however. The main point of the 1st and 2nd sections was to establish a minimum viable estimate because I think some of the numbers coming out survey’s just don’t work. Even if numbers are increeased 10 times or 20 times, the survey numbers I am critiquing still don’t work in global proportions.

Edit 2: So on top of accidentally massively undercounting our dear Europoors because I hadn’t seen the Swedish data it appears we have a complete mess in using Bloomberg data which I didn’t know about. Apparently Bloomberg reports solely on the basis of Institutional data and doesn’t show individual retail. Yet the institutions of banks, brokers, hedge funds, pensions, retirements etc. are composed of enormous numbers of retail investors directly or indirectly. How this would be reported, and if it is even possible for retail to be separated out of institutions I don’t know or understand. Another really important comment I have seen is that international investors (if they’re reported at all) could be getting reported as Americans if the end point of their trade is an American institution such as Bank of New York Mellon as many international brokers operate partnerships for access to American markets through American institutions.

This could be a major factor confounding the count of Europeans and others.

The post’s first and second sections will remain unaltered below as I don’t know what or how to correct. The numbers are a mess to try and work with, informational assymetry is real. Just remember that even with fucked up undercountings of minimum estimates we still own the float by a lot.

TL;DR

Canada and Germany do not own the float
America owns the float.
If USA survey data is reasonably accurate then there are approximately 5,396,522 GME Shareholders worldwide. 4,811,000 Americans, 29,141 Canadians, and 6475 Germans.
Survey data from the USA, Canada and Germany when filtered through Bloomberg data gives an estimated total of between 106,666,617 and 144,442,271 synthetic shares in existence owned by GME holders, th

... keep reading on reddit ➡

👍︎ 2k

💬︎

👤︎ u/KFC_just

📅︎ Aug 11 2021

🚨︎ report

Is a Data Warehouse the correct place for combining/storing private company financial statement data?

I consult for a number of companies all owned at least in part by the same person. The data (income statement, balance sheet, etc.) is stored in excel files and various accounting softwares for each. Recently, the discussion came up that it would be great to bring together all this information in a single place, even if just the key line items are available at first.

At first, I thought a SQLite database with a few dimensions (like this) would do the trick, but I can't help but wonder if I'm: 1. Using the right tool for the job 2. Thinking about this project the right way.

This would be my first experience with schema design/creation. Has anyone here had experience with a project like this?

👍︎ 2

💬︎

👤︎ u/steelThunderMountain

📅︎ Dec 21 2021

🚨︎ report

Zapier is hiring "Sr. Data Warehouse" Engineer! twitter.com/moderndatasta…

👍︎ 3

💬︎

👤︎ u/growth_man

📅︎ Jan 11 2022

🚨︎ report

5 Part Series - Demystifying Data Warehouses / Data Lakes / Lake Houses confessionsofadataguy.com…

👍︎ 8

💬︎

👤︎ u/dataengineerdude

📅︎ Jan 10 2022

🚨︎ report

Moved from analyst to BI role: learned the tools and environment, but how to learn general data warehouse modeling concepts?

Two years ago I switched from a data analyst to a broad BI developer role in our company. In that time I got very familiar with SQL/SSMS, SSIS, SSAS, SSRS and I know how to work with and expand the existing data warehouse at our company with new data sources and additional facts/dimensions. Very practical.

However, coming from an analyst/statistics background I feel I'm still missing a good understanding of data warehouse modeling concepts. Things like:

Normalization/denormalization
Kimball
Inmon
Data vault
Advantages/disadvantages of different data model schemas

Any tips for (types of) online courses I should consider to properly learn the concepts and theory of data warehousing and data modeling? What terms should I look for?

I feel I'm going to need to understand this stuff more thoroughly. Not just to deal with the impostor syndrome, but I also get the sense our only senior BI colleague is looking to go elsewhere...

Any suggestions would be appreciated!

👍︎ 32

💬︎

👤︎ u/1080Pizza

📅︎ Nov 28 2021

🚨︎ report

Database Systems: Introduction to Databases and Data Warehouses

ISBN-13: 978-1943153190; Prospect Press; 1st edition (June 2, 2016)

👍︎ 2

💬︎

👤︎ u/ChildPlease90

📅︎ Jan 09 2022

🚨︎ report

Data warehouse - Normalized vs denormalized fact table (measure type dimension)

I'm designing a data warehouse to hold survey data. It will hold various types of surveys each with different questions and answers. I'm trying to implement a conventional star schema but have some doubts regarding the fact table.

The two options are to go for what Kimball calls a measure type dimension (essentially the normalized version with a column containing the questions and another the answers) or fully denormalize the table and have a column for each question populated with respective answers. If I go for the latter approach I might eventually end up with up to 10 fact tables (was thinking one per survey type would be easier to manage in this case) each having 50-100 columns vs a single fact for the first.

The way I see it the first approach has the main advantage that it's easier to maintain and has a fixed schema even if new questions are added, while the second would be better for validating the data, connecting to olap and visualisation tools and general, and would make it a lot easier to aggregate since each measure would have its own column. Lastly, the first with be tall and dense while the second approach would be very wide and sparse (1 measure or value per row at the limit).

If you have any experience or have been burned by this in the past please share your thoughts. Which one would you go for and why?

👍︎ 9

💬︎

👤︎ u/TheLastKingofReddit

📅︎ Dec 18 2021

🚨︎ report

Part 2 – How Technology Platforms affect your Data Warehouse, Data Lake, and Lake Houses. confessionsofadataguy.com…

👍︎ 3

💬︎

👤︎ u/dataengineerdude

📅︎ Jan 15 2022

🚨︎ report

Are Delta Lakes ready to take over Data Warehouses?

I'd like to generate some discussion around the topic of these two architectures.

Reverse question: will modern data warehouses be able to take over data science?

The way I see it, products originally developed on top of Data Lake Architectures such as Databricks, Hudi or Iceberg, and products originally developed as Datawarehouses such as Snowflake and BigQuery, are all trying to converge on what we call the "Lakehouse" paradigm - to combine data warehousing and data science workloads and ecosystems onto one platform.

With all of the new features introduced through Delta Lake and Spark 3.0 around transactionality, incrementality, query optimization, pruning etc., do you think it's ready to take over OLAP analytics and modeling? Why/why not?

👍︎ 49

💬︎

👤︎ u/vassiliy

📅︎ Oct 08 2021

🚨︎ report

Why do we need a periodic snapshot table in a data warehouse?

I'm reading Kimball's datawarehouse toolkit and I am having a hard time understanding the need for a periodic snapshot table. It seems to be that we can just create it from the transactional table by running aggregations grouped by the interval of interest (daily, weekly, monthly, etc)

👍︎ 4

💬︎

👤︎ u/mrTang5544

📅︎ Dec 13 2021

🚨︎ report

How expensive are certain (easy) computations for mainframe data warehouses?

Hi, in summary here's the situation: A large enterprise with terabytes of data on Teradata (heh). But as this is batch-run, it cannot be used for real-time information from many source databases to web applications. There's ongoing work to CDC this by building an intermediate layer, converting it and making it match with all of the different sources warehouses (some of which predate Star Wars) which ultimately ends up as a DB2 on z/OS (mainframe). An actual approach for data quality is pretty much nonexistant in this organization, so the general idea now (on the very short term) is to perform simple data profiling calculations on this DB2 (which is now the new 'single source of truth') and spitting it out to some reporting screen. Down the road, management is going to look at the whole circus (Apache Griffin etc). Now I was told you cannot run these calculations every 5 minutes or so because it would use too much resources (Average, median, length(field), min/max, quantile, group by and then all of the above, timediff (of 2 colomns), unique, frequency, count NULL and perhaps some more). There are about 7-8 tables, 4-5 columns each and let's say about a billion records. On my personal computer I've done these simple calculations in Python (on a single processor) on large datasets (>2gb) in a few sec at most, albeit it was just a pandas dataframe (or 2). So how would this clog up a data warehouse that is built for these kinds of tasks. How would I know if performing these calculations would be too much (without actually performing them, I don't have access). I just want to have a general idea whether these commands cause most systems to clog up with the amount of records, or whether it really doesn't matter still.

Me: A recently employed data scientist who was put on a data engineering task and has never seen his colleagues in real life.

👍︎ 9

💬︎

👤︎ u/Background_Claim7907

📅︎ Nov 19 2021

🚨︎ report

Good technical resources about building a data warehouse?

I have worked with BI in the past, but mostly in the analytics part: querying stuff from a data warehouse, generating analysis and such. So I know the very basics of databases and I know basically how PostgreSQL works.

Now I'm in a company that has zero infrastructure regarding a data warehouse. People run stuff on their own computer and store it in Excel and CSV on cloud.

I want to build a data warehouse from scratch, beginning with getting a VM to run procedures and run the database there, but I know nothing about this part. In fact I know so little regarding the technical part here that I don't even know what to search for and can't even elaborate further on what resources I need you to point me to. Someone asked me if I wanted to run a cycle server or not and I don't even know if it's yes or no.

Can anyone point me to good resources to learn the technical part that goes around building a DW?

👍︎ 22

💬︎

👤︎ u/throweyacc2

📅︎ Jan 20 2022

🚨︎ report

How should a good data processing application for a Data Warehouse (based on Python and PySpark) be made?

How should be made?

👍︎ 24

💬︎

👤︎ u/Martinez__

📅︎ Dec 19 2021

🚨︎ report

What is the relationship between MDM to data warehouse and data lake?

👍︎ 3

💬︎

👤︎ u/fireonmarz

📅︎ Jan 10 2022

🚨︎ report