A list of puns related to "Data Warehouse"
I have worked with BI in the past, but mostly in the analytics part: querying stuff from a data warehouse, generating analysis and such. So I know the very basics of databases and I know basically how PostgreSQL works.
Now I'm in a company that has zero infrastructure regarding a data warehouse. People run stuff on their own computer and store it in Excel and CSV on cloud.
I want to build a data warehouse from scratch, beginning with getting a VM to run procedures and run the database there, but I know nothing about this part. In fact I know so little regarding the technical part here that I don't even know what to search for and can't even elaborate further on what resources I need you to point me to. Someone asked me if I wanted to run a cycle server or not and I don't even know if it's yes or no.
Can anyone point me to good resources to learn the technical part that goes around building a DW?
Theres a lot of hype about Data Mesh nowadays. Some people have gone to lengths saying that data warehouse should be replaced with data mesh.
I dont get it. Data mesh is just a way of doing exactly the same thing using data products. No ? Or am i wrong ?
I dont see how data mesh can replace data warehouse. Can anyone please advice.
Hi everyone,
I'd be interested to see what those more experienced than me would do in my situation.
I'm currently one of three analysts in a pretty old school business (logistics). They've only just created an analytics department over the past ~3 years, so there are no data engineers and I doubt management would believe that that's even a real job title. I've taught myself SQL and Python and am pretty confident with both (more-so Python than SQL).
I spend most of my time creating reports in PowerBI. This is where I've started noticing issues, namely:
I've come to the conclusion that considering these issues a data warehouse would probably
... keep reading on reddit β‘A client department of ours is trying to get our IT department to take on a data warehouse initiative. My colleagues and I have no idea how to build one and the only data warehouses I've seen from parent agencies and my former employer has been abject failure. I feel that /r/programming does not mention data warehouses as they're absolutely nonsense with cultists proposing they exist but then giving no fundamental direction on how to build one. Even the wiki article is confusing on data warehouses and it tends to fall on one citation in particular, Ralph Kimball. We suspect that the client department has been claiming that we are lacking a data warehouse to mask their own performance.
Am I right to be skeptical of data warehouses or is there an actual theory and definitive value add to them that I'm missing?
Hi! I am a Data Engineer & Analyst at a non profit organization. I'm pretty much the only person with an engineering/coding background and since I don't really have any guidance in my organization I was wondering what I could improve in the implementation I've built with our data warehouse and pipelines.
The setup is a BigQuery DW with some Cloud Functions running Python pipelines to feed data into it. We currently only have a few pipelines setup pulling data from a Postgresql DB, where we just pull the all the data we want from the Postgres DB table(maximum a couple hundred thousand rows containing only text and numbers), truncate the previous version and insert all the data again
This was done because the Postgres tables had no reliable timestamps to handle only inserting deltas. Please advise if there is any other cleaner way to do this!
We also have some more tables in there where data is inserted manually every quarter since we get a big file in a very dirty format, so I came up with a script to clean it but still manually upload it.
We are also considering building another one to get more data from a partner's Redshift instance.
PLEASE NOTE that no modeling was performed at all for this data warehouse, I just figured which tables and columns would be useful for our analytics and pull all these data into BigQuery to feed our Data Studio reports.
There are also no tests being performed in the pipelines and even though I have a git repository for the pipelines, if I want to change the code for the Cloud Function I have to go in there and manually change it for each.
I am just looking for some general guidance on ways to improve this implementation since I feel a bit lost on what the best next steps would be.
Thank you!
Hello everyone, I've been tasked with a new endeavor at my job of managing the data warehouse that enables a suite of products to work properly. I am getting up to speed as quickly as I can, but figured it could be beneficial to ask this community for any advice from those who had been in a similar situation.
-Anything you'd wished you'd known before joining a data warehouse team? -Any advice for handling expectations/deliverables with the teams who have products that consume the data? -Is there any way to help people removed from the data warehouse understand that getting the data pipeline ingestion automated and rock solid will enable us to react more on the fly to their fire drills? -Any ceremony practices or work processes that were effective but different than more traditional products whose users are external clients?
Thanks in advance. This is all pretty far outside the realm of the work I've done before.
Hi Friends
I would like to hear if any one implemented new data ware house modelling approach aka unified star schema, look like model can be built without any business requirement and as I sense it claims to solve most modelling challenges
Any practical comments?
Reference The Unified Star Schema: An Agile and Resilient Approach to Data Warehouse and Analytics Design
Hi all, I am currently handling over thousands of SKU under one organization and all of them have their unit conversion set up.
Client recently changed supplier and all unit conversion are now invalid...
Question is, is there a way to keep the current SKUs, but replace the unit conversion without having to deactivate all SKU and reupload?
All help are appreciated !
According to Wikipedia: "Statistics is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data***"***
Understood that collection and organization is covered from surveying, possibly experimental design, etc., but wouldn't it be prudent to be fluent in SQL/databases as part of a statistical curriculum? Especially since these systems are fairly popular and common?
I will be creating a datawarehouse from ground up. Kind of lost on where to begin since I've thought I'm only be doing the maintenance for the current infrastructure.
Setup: I am working for a medium sized (?) manufacturing company with 500 customers currently. Our sales department manually generates daily/weekly/monthly/yearly reports and they get these info from different data sources. They want to do analytics and slice and dice type and we already bought a Tableau for the reporting.
Q:
Is MySQL + Cloud SQL good for this DWH setup. If no, what's your recommendation? What's the best practices? How to begin with integrating an ETL process to fetch all the historical data to a staging layer? What are the things we should install for these?
Appreciate all your assistance!
Iβve been working on a personal project for a while which scrapes Reddit comments from a subreddit on a daily basis. I collect specific data and output this in csv format to a raw landing s3 bucket.
Current architecture:
Only daily basis, AWS Eventbridge (previously known as cloud watch events) is scheduled to run at a specific time daily via cron.
This cron triggers an AWS Lambda Function (which uses a Docker image with web scrape code using an image from AWS ECR). This scrapes the subreddit and outputs to s3 landing directory
Aws glue is scheduled to run daily to load Athena table
However, I want to replace step three and use boto3 to load snowflake table instead as it is a more βrealβ DWH (unlike Athena).
I know at this point with the size of data it is definitely overkill, but it makes future scaling way easier. Has anybody done a personal project using snowflake as the DWH. I would query snowflake not too frequently (less than 30 times a month). Is this smart to consider future scalability because I know I am going to aggregate a lot of data at some point. Input appreciated.
Hey everyone, just looking for a little guidance on some issues that are coming up with how my org works with data. Iβll preface all of this by saying Iβm a full stack engineer with minimal data engineering experience so I apologize if this is a noob question. We have a variety of data coming in through different sources (airflow jobs, airbyte, fivetran) and it ends up in our (snowflake) warehouse. Iβm finding that Iβm more often wanting to pull data out of the warehouse to reintegrate into existing api/service databases. Is this a smell? Should I use the warehouse directly instead? Iβve come across the concept of reverse ETL but that seems more like tools that sync data into third party services.
Hey all, as the title explain it, I need to convert a few dashboards made from oracle sql to GCP. I did one today, I and did the following:
Rewrote the query Using BigQuery syntax
Created a new PBIX file and connected to my GCP project, imported the data
I copied the whole connection plus query string from the query editor
Opened the original PBIX dashboard, opened the oracle query and clicked βedit queryβ. In there, i pasted the GCP string and the data refreshed just fine.
Posted the dashboard to PBI service and all seems to be working fine.
I need to do this to other dashboards that have up to 7 oracle queries as different tables. Can I do the same I did not the steps above, Or is there an easier/better way to just change the data source? Iβm trying to avoid redoing the dashboards from scratch, or recreating relationships.
Thanks.
I am part of a project that consists of multiple organizations wanting to share data with each other on an ongoing basis. It would consist of the same data elements from each company, and it would be refreshed at a set interval. Ideally, we would like to join all of the collected data together while providing access to the data for each of the companies involved. What would be the best approach for accomplishing this that would also have a high level of security?
Hi all - recently learned that I could load a csv or flat file to netezza using external table which is really helpful when the data warehouse isn't specified to your needs. Even though I was super thrilled with this discovery, I would like to now load a power query table without having to export to a flat file. Wondering if anyone has done anything like this! Was thinking that I would need to reference the table in power query through some sort of address string.
Edit 1: Apparently I had missed this and related posts indicating the number of Swedish Apes was some 40 thousand, and the German sub is 13 thousand. Ok thats a facepalm for sure. Adjust all estimate numbers upwards by a minimum of 4 times to maintain the proportionality. I will keep the numbers originally written as is however. The main point of the 1st and 2nd sections was to establish a minimum viable estimate because I think some of the numbers coming out surveyβs just donβt work. Even if numbers are increeased 10 times or 20 times, the survey numbers I am critiquing still donβt work in global proportions.
Edit 2: So on top of accidentally massively undercounting our dear Europoors because I hadnβt seen the Swedish data it appears we have a complete mess in using Bloomberg data which I didnβt know about. Apparently Bloomberg reports solely on the basis of Institutional data and doesnβt show individual retail. Yet the institutions of banks, brokers, hedge funds, pensions, retirements etc. are composed of enormous numbers of retail investors directly or indirectly. How this would be reported, and if it is even possible for retail to be separated out of institutions I donβt know or understand. Another really important comment I have seen is that international investors (if theyβre reported at all) could be getting reported as Americans if the end point of their trade is an American institution such as Bank of New York Mellon as many international brokers operate partnerships for access to American markets through American institutions.
This could be a major factor confounding the count of Europeans and others.
The postβs first and second sections will remain unaltered below as I donβt know what or how to correct. The numbers are a mess to try and work with, informational assymetry is real. Just remember that even with fucked up undercountings of minimum estimates we still own the float by a lot.
I consult for a number of companies all owned at least in part by the same person. The data (income statement, balance sheet, etc.) is stored in excel files and various accounting softwares for each. Recently, the discussion came up that it would be great to bring together all this information in a single place, even if just the key line items are available at first.
At first, I thought a SQLite database with a few dimensions (like this) would do the trick, but I can't help but wonder if I'm: 1. Using the right tool for the job 2. Thinking about this project the right way.
This would be my first experience with schema design/creation. Has anyone here had experience with a project like this?
Two years ago I switched from a data analyst to a broad BI developer role in our company. In that time I got very familiar with SQL/SSMS, SSIS, SSAS, SSRS and I know how to work with and expand the existing data warehouse at our company with new data sources and additional facts/dimensions. Very practical.
However, coming from an analyst/statistics background I feel I'm still missing a good understanding of data warehouse modeling concepts. Things like:
Normalization/denormalization
Kimball
Inmon
Data vault
Advantages/disadvantages of different data model schemas
Any tips for (types of) online courses I should consider to properly learn the concepts and theory of data warehousing and data modeling? What terms should I look for?
I feel I'm going to need to understand this stuff more thoroughly. Not just to deal with the impostor syndrome, but I also get the sense our only senior BI colleague is looking to go elsewhere...
Any suggestions would be appreciated!
ISBN-13: 978-1943153190; Prospect Press; 1st edition (June 2, 2016)
I'm designing a data warehouse to hold survey data. It will hold various types of surveys each with different questions and answers. I'm trying to implement a conventional star schema but have some doubts regarding the fact table.
The two options are to go for what Kimball calls a measure type dimension (essentially the normalized version with a column containing the questions and another the answers) or fully denormalize the table and have a column for each question populated with respective answers. If I go for the latter approach I might eventually end up with up to 10 fact tables (was thinking one per survey type would be easier to manage in this case) each having 50-100 columns vs a single fact for the first.
The way I see it the first approach has the main advantage that it's easier to maintain and has a fixed schema even if new questions are added, while the second would be better for validating the data, connecting to olap and visualisation tools and general, and would make it a lot easier to aggregate since each measure would have its own column. Lastly, the first with be tall and dense while the second approach would be very wide and sparse (1 measure or value per row at the limit).
If you have any experience or have been burned by this in the past please share your thoughts. Which one would you go for and why?
I'd like to generate some discussion around the topic of these two architectures.
Reverse question: will modern data warehouses be able to take over data science?
The way I see it, products originally developed on top of Data Lake Architectures such as Databricks, Hudi or Iceberg, and products originally developed as Datawarehouses such as Snowflake and BigQuery, are all trying to converge on what we call the "Lakehouse" paradigm - to combine data warehousing and data science workloads and ecosystems onto one platform.
With all of the new features introduced through Delta Lake and Spark 3.0 around transactionality, incrementality, query optimization, pruning etc., do you think it's ready to take over OLAP analytics and modeling? Why/why not?
I'm reading Kimball's datawarehouse toolkit and I am having a hard time understanding the need for a periodic snapshot table. It seems to be that we can just create it from the transactional table by running aggregations grouped by the interval of interest (daily, weekly, monthly, etc)
Hi, in summary here's the situation: A large enterprise with terabytes of data on Teradata (heh). But as this is batch-run, it cannot be used for real-time information from many source databases to web applications. There's ongoing work to CDC this by building an intermediate layer, converting it and making it match with all of the different sources warehouses (some of which predate Star Wars) which ultimately ends up as a DB2 on z/OS (mainframe). An actual approach for data quality is pretty much nonexistant in this organization, so the general idea now (on the very short term) is to perform simple data profiling calculations on this DB2 (which is now the new 'single source of truth') and spitting it out to some reporting screen. Down the road, management is going to look at the whole circus (Apache Griffin etc). Now I was told you cannot run these calculations every 5 minutes or so because it would use too much resources (Average, median, length(field), min/max, quantile, group by and then all of the above, timediff (of 2 colomns), unique, frequency, count NULL and perhaps some more). There are about 7-8 tables, 4-5 columns each and let's say about a billion records. On my personal computer I've done these simple calculations in Python (on a single processor) on large datasets (>2gb) in a few sec at most, albeit it was just a pandas dataframe (or 2). So how would this clog up a data warehouse that is built for these kinds of tasks. How would I know if performing these calculations would be too much (without actually performing them, I don't have access). I just want to have a general idea whether these commands cause most systems to clog up with the amount of records, or whether it really doesn't matter still.
Me: A recently employed data scientist who was put on a data engineering task and has never seen his colleagues in real life.
I have worked with BI in the past, but mostly in the analytics part: querying stuff from a data warehouse, generating analysis and such. So I know the very basics of databases and I know basically how PostgreSQL works.
Now I'm in a company that has zero infrastructure regarding a data warehouse. People run stuff on their own computer and store it in Excel and CSV on cloud.
I want to build a data warehouse from scratch, beginning with getting a VM to run procedures and run the database there, but I know nothing about this part. In fact I know so little regarding the technical part here that I don't even know what to search for and can't even elaborate further on what resources I need you to point me to. Someone asked me if I wanted to run a cycle server or not and I don't even know if it's yes or no.
Can anyone point me to good resources to learn the technical part that goes around building a DW?
How should be made?
Please note that this site uses cookies to personalise content and adverts, to provide social media features, and to analyse web traffic. Click here for more information.