A list of puns related to "Data Architecture"
Hello everyone,
I'm working on a project with a web app in Flask (A dashboard), which obtains a hashtag from the user and then connects to the Twitter Streaming API to obtain in real-time all the tweets that are being generated with that #. Later, they are sent to a PySpark Streaming service to do certain operations with them and finally, return the results to the dashboard client.
So far, I've been using python sockets to do all the communication (Including threading for sending/receiving the tweets and all), but I would like to know if there are other professional tools I might be missing and could make the project more professional and practical (Considering that multiple clients are expected to use the service at the same time)
Thanks!
Hello Guys & Gals, Hope everyone is doing well,
The reason I'm raising this question is that we are redesigning some of our applications based on Microservices approach but we are struggling with how to share our data between different services where each service has its own database, to summon it into a picture we are implementing the below system:
Figure 1 - Database per service
For instance, we have a table in Service B called "orders" and this table has a column called "user_id" which points to the end user who submits the order, now we have another table in Service A called "users" which stores the user credentials along with other related data; the challenge here is to show the username along with its orders in Service B.
In a simple application with shared database the query would have been like:
SELECT
ORDER.NAME``,
ORDER.PRICE,
USER.NAME
FROM ORDERS
JOIN USERS ON
USERS.ID
= ORDERS.USER_ID
Now since we have individual databases for each single service, we won't be able to do it like above.
A dummy solution to this major issue is to collect "user_id"s and send it all via HTTP request to Service A and fetch the users' usernames and join the fetched data with "orders" table in Service B.
Though above is a basic example of what we may face during our development but in fact, it may even get into more complicated scenarios if other services are involved (A complicated service like CRM).
We found some solutions like:
In "Shared Database" and "Database Replication" architectures, although the implementation would be much easier but we will lose the significant benefit of Microservices which is "loosely coupling" like if we change the structure of one table in our database we have to change our code base in other services accordingly, along with not being able to use different database types as well.
What are your thoughts?
Is there any final and definitive solution to the above issues?
Any books, suggestions to help us deal with the problem would be much appreciated.
Thanks in advance.
I know this is a loaded question but its my first project using mondgo db and I am overwhelmed with how to handle the database portion. Not sure if this type of question is allowed but any help or resources is appreciated.
Data
Any idea how to organize this the best? Do I use a separate DB for each doctor office? Do I create a collection for "patients" and a new collection for each and every test? How do I relate the tests to a patient?
I was thinking the following:
Any recommendation are appreciated
So I'm studying for the GCP PDE exam and don't have a huge amount of hands on DE experience. I finished the 6 part Coursera course that's mostly aimed at preparing a DE for working within the GCP ecosystem. It's been helpful but I've been going through question banks/practice tests and realized I'm missing a fair bit of the hands on knowledge about migrating from legacy setups and how to properly troubleshoot things when standing up a new pipeline/tool.
I know a lot of this comes down to hours spent on the job but do any of you have recommendations on resources that were helpful for things like going from Hadoop to Spark workflows, updating legacy SQL code, when to push or pull new data or use a publish/subscribe model etc. Basically looking for resources on data architecture and common troubleshooting.
I'm currently a few chapters into Designing Data Intensive Applications and it's been great at helping me build a mental model of how some of these tools work but so far it seems a bit general. Also this sub has been great too for exposing me to new topics and y'all are great!
TLDR: Looking for resources on data architecture, troubleshooting and hands on stuff. GCP specific would be nice but AWS or open source works too. Halp plz!
Hi everyone,
wanted to get your opinion on the subject on how to best manage ETLs at my current job.
Basically we have not so bad architecture of dozens of micro services, all deployed to AWS, infrastructure managed by terraform.
But when it comes to the ETLs I think it is very cumbersome what we basically do:
Most of the time this is basically fetching and joining data - which could be easily done in any DWH or even database (our data is ~2TB at most) but we have everything spread between dozens of postgres databases. Most of the jobs are batch ETLs. Than with that data we do something (can be pushing to some other database, running some ML models or maybe sending emails, standard stuff).
So now there are two things:
Let me know what do you think is the way to go here.
Hello
could someone please share insights on what implementation of distributed data store to choose to learn in-depth on the Architecture concepts?
The list consists of Apache Cassandra, MongoDB, Amazon DynamoDB.
I'd like to hear why you consider one over the other when it comes to learning the concepts and architecture. My hypothesis is all three share a lot of common characteristics in Architecture.
Please share your thoughts. Thanks
Hey everyone.
We wrote a post, as part of a series of posts, about our data stack. It includes some dogfooding for obvious reasons and hopefully not a lot of inevitable bias. But I believe it will be helpful for anyone who's interested in designing/implementing a data stack.
You can find the post here: https://rudderstack.com/blog/rudderstacks-data-stack-deep-dive
Feedback is more than welcome and happy to answer any questions.
https://www.turing.ac.uk/research/research-projects/turing-benchmarking-framework
Project Status: Finished.
Data structures
Numerical (Algorithms)
Software framework development
Hardware optimisation (FPGA/GPU)
Statistical methods & theory
Optimisation
Computing networks
Parallel computing Neural networks
Machine learning
looking for some real world experience in implementing a new veeam solution.
Do you use a proxy? proxy vm/physical with vm/physical backup server ?
Put it all on one physical machine?
I've read that the backup server should be at the DR site. Has anyone used a vm backup server and replicated to DR site?
love to hear peoples suggestions/anecdotes
If you could build a real-time data architecture from scratch, what tools would you use?
Context:
Needs:
I am currently a college student. I have a course that will make me do ETL and some data architecture. I am not really that aware of ETL or data architecture and want some information on it. Is ETL or data architecture the most difficult thing in the data science field to understand the concept of it. If you need to a comparison, would it be more difficult to understand those concepts compared to learning business analysis? Is data architecture anything like regular architecture?
Hello everyone!
As the title says, I would like to know your process. The reason for this is that I am still a bit lost when it comes to design solutions. There is a lot of information out there and seems like a there are different ways to create a solution.
How do you decide whether the data should go into a lake or a warehouse?
Is the data architecture oriented to the startup's growth system? e.g.: product-led, marketing-led, etc.
https://preview.redd.it/lhzm7w9ca1z71.png?width=1024&format=png&auto=webp&s=12d26c6d76aa5a1e2321ceacf37ddf728d26db8b
For example, in this architecture we load raw-data to datalake and then a warehouse, but I have heard that it is better to upload the data to a lake and transform it without having to move it to a warehouse. In what situation is it better to upload the data to a warehouse than to a lake?
Another thing I want to ask is why many people use pipelines made with python when there are solutions like Airbyte that allow you to automate this? Is there something I'm not seeing?
Sorry for my bad English and I hope the point of this post is clear.
So Jensen presented his keynote today and fools are quick to sell AMD's shares believing. Jensen was using "Accelerated Datacenter" see title on nVidia's official link below!
https://blogs.nvidia.com/blog/2021/11/09/nvidia-ceo-accelerated-computing-ai-omniverse-avatars-robots-gtc/
But it's just words, a "vision", the "future"... does he have an answer NOW to cDNA 2 MI200 datacenters GPUs? No! Does he have a way to use EPYC Milan-X or future Genoa to connect his GPUs COHERENTLY? No. Infinity Fabric only available to connect between AMD's chips and chiplets!
Enough said!
I am looking for a software or method for generating architectural diagrams (system diagrams, component diagrams) from structured information on systems and their relationships. Let me explain.
Let's imagine an enterprise context with hundreds of components shared between dozens of departments and projects.
The final objective is to generate a diagram of the current architecture (dedicated to the specific use-case) from the filters applied, for example:
Do you use or know of anything like this?
We have several clients for whom we need to create a BI tool. Looks like good opportunity to look ahead and try to build a SaaS tool, since we already have the clients to pilot with.
The idea at its simplest: a client connects their datasource and our tool shows dashboard with insights. We already have a MVP which our clients like, but works only with manually uploaded CSV files with small amounts of data. Our next step is to design a scalable data architecture. We have very little hands-on experience with this, currently we are just doing some research.
The problem we see is that we do not know data schemas a priori. Each client has different dataset with a different schema. So we cannot simply design a normalized data model for fast ad-hoc queries which will work with everybody. Or can we? What are some good approaches to this?
Datasources may be:
Challenges we identified so far:
Forgive me if I use incorrect terminology, still learning.
Xmas shopping at a bookshop, in the architecture section was a book on data centres - WTF? It was shrink-wrapped, so couldn't see what they included. The data centres I have seen are rectangular with hardly any windows, can't see the attraction.
Hello,
in my team, we currently have a monolith and want to move to a microservice architecture (for various reasons). One problem we however face is to figure out which service should save which data.
For the sake of the example, let's say that we have an entity which is a simple Word Document.
So we have one service for creating, editing, saving, deleting etc. the Word document (among other things like listing and other CRUD operations)
This entity has various states, such as
Now these states are being reached through workflows. The workflow however can be configured for each user individually, so one user can move an entity from Draft to Open, another can just move it from Draft to Completed (and is missing Open/In Progress state). Also, when a specific state is reached, we send emails (through message queue which is listend to by another microservice)
-----
So what we want to do, is to create a new microservice which saves the configuration of the workflow and actually performs the state changes. So it has the ID of the entity and the workflow configuration and then decides in which state it should get next.
I hope I gave you enough insight on how we want to build the microservice architecture. I know it could be quite hard to grasp the whole concept from this small example.
The problem we are facing however now, is which service should save the ACTUAL state of the entity.
So the WordDocumentService has it's own Database (for saving the data in general, such as the title, filename, content) and the WorkflowService has it's own Database (for saving the workflows and which entity has which workflow assigned)
So should the current state (Draft, Open, etc.) be saved
So in my opinion, everything related to that should
... keep reading on reddit β‘Currently in the process of building up my SaaS product and some users would like to have sample data so that they can explore the product right after signing up.
Now Iβm using a microservices architecture where each service manages its own database so Iβd have to insert data on pretty much every service.
I thought of a few solutions:
Have every service deal with sample data itself, maybe an endpoint where you can insert it for the user (cons: high coupling if data is dependent on each other, changes to the data would be pretty hard)
Have a central service/function deal with managing the sample data, it will then just call every service and insert the required data (con: what if Interfaces of services change?)
What do you guys think? Are there any existing solutions or patterns for this problem? Quite a lot of products offer this service but I was not able to find much on it
Hello All. I am currently working for a retail analytics company as a Data Engineer + Data Scientist. As of now I have built a simple alerting system. It checks for alert subscriber and alert type during change calculation in ETL ( for example price change) and sends email if match is found. Its done via python scripts with somewhat reconfigurable alert sink ( SQS, email etc) and alert data ( count of product with change or all products list,). However its time to make it more scalable ( upto 1000 alerts per minute) and separate it out from main ETL pipeline. I always prefer simple solution. Any suggestion on any aspect of the process is welcomed. Thank you
I started a new job as a data engineer this week, before that I was a data analyst. I have strong technical background* but I have a very little hands-on experience with data engineering.
I am supposed to design a data architecture for a product analytics tool (SaaS) similar to Amplitude or Fivetran. So the vision is to build a tool with automated data integration, support for adhoc querying, dashboards, etc. **
Where should I start?
I do not know what questions should I consider, what components in the stack should I research for, etc. I started to do some research today, you will laugh but the only thing I was able to do was to compare Elasticsearch and TimescaleDB, and I have found SingleStore which looks like something I should give my attention to, but I do not know what should I do next.
I will appreciate all kinds of advice.
___
edit:
* according to Dataengineer wiki, I would be somewhere in between of data engineer and senior data engineer. which is probably funny since I said that I have a little hands-on experience. What I meant is that I have never designed or discussed data architecture but I have quite rich experience with full-stack software development (mostly web apps) and some experience with data science.
** our customers will be from different business scopes with 10s to 100s GB large datasets (custom schemas). there will be common analytic use cases but each of the customers will have their own unique use cases.
I am currently a college student. I have a course that will make me do ETL and some data architecture. I am not really that aware of ETL or data architecture and want some information on it. Is ETL or data architecture the most difficult thing in the data science field to understand the concept of it. If you need to a comparison, would it be more difficult to understand those concepts compared to learning data analytics or database management systems? Is data architecture anything like regular architecture?
Please note that this site uses cookies to personalise content and adverts, to provide social media features, and to analyse web traffic. Click here for more information.