Hadoop MapReduce vs Apache Spark

why use Hadoop MapReduce transformation if you can use Apache Spark and call map transformation and then call reduce?

👍︎ 5
💬︎
📅︎ Jan 12 2022
🚨︎ report
Sensex Log Data Processing (PDF File Processing in Apache Map Reduce) Project (Hadoop Project) projectsbasedlearning.com…
👍︎ 4
💬︎
📅︎ Jan 13 2022
🚨︎ report
Hadoop - Apache Spark Help

I'm working on a textbook practice project for a class where I need to find the maximum hourly electricity consumption and the average daily electricity consumption. Here is some sample data (the ENERGY_READING column is cumulative), these are records inside of a text file:

LOG_ID HOUSE_ID CONDATE CONHOUR ENERGY_READING
185 1 2012-06-01 01:30:40 5200.51948
186 1 2012-06-01 01:30:50 5200.522288
187 1 2012-06-01 01:31:00 5200.525096

Here is what I have for the maximum hourly consumption, but I'm not sure how to take into account that the energy levels are cumulative, and I need to somehow also display the hour in the results.

String filepath = "/Sparkresults/MaximumConsumption";

SparkConf conf = new SparkConf().setMaster(some ip address).setAppName("Maximum         
 Hourly Consumption");

JavaSparkContext sc = new JavaSparkContext(conf);

JavaRDD<String> textFile = sc.textFile("/data/energydata");

// Remove header row
String header = textFile.first();
textFile = textFile.filter(row -> !Objects.equals(row, header));


JavaPairRDD<String, Double> results =  textFile.map(ln -> ln.split("\t"))
        .mapToPair(rec -> new Tuple2<>(rec[1], Double.parseDouble(rec[4])))
        .reduceByKey(Math::max);


results.saveAsTextFile(filepath);

And I have no idea how to even begin the average.

Any help would be much appreciated.

👍︎ 5
💬︎
📅︎ Dec 01 2021
🚨︎ report
Please recommend good text based learning resource to learn distributed systems , Hadoop and Apache spark using Scala ?

I prefer to learn via reading, so looking forward to text based learning resources. Something like mozilla guide for web developer. Please help .

👍︎ 3
💬︎
📅︎ Sep 09 2021
🚨︎ report
Is it true that "Apache products, like Hadoop, Hive, and Spark, continue to decline in importance"?

https://www.infoworld.com/article/3615695/it-s-pythons-all-the-way-down.amp.html says

"Apache products, like Hadoop, Hive, and Spark, continue to decline in importance."

If that's true, why and what is replacing spark?

Thanks.

👍︎ 61
💬︎
👤︎ u/timlee126
📅︎ Apr 25 2021
🚨︎ report
How important are Hadoop, spark, nosql, Apache?

I often hear about these technologies being mentioned in the datascience industry. How important are these to know vs knowledge of statistics/ml algorithms, regular SQL/r/python?

👍︎ 212
💬︎
👤︎ u/blueest
📅︎ Jan 27 2021
🚨︎ report
Difference Between Hadoop and Apache Spark bigdatapath.wordpress.com…
👍︎ 2
💬︎
📅︎ Sep 22 2021
🚨︎ report
James Serra's take on centralized vs. decentralized ownership, Uber's containerizing Apache Hadoop, LinkedIn's journey from the daily dashboard to enterprise-grade data pipeline, Alibaba Cloud's CDC analysis with Apache Flink & Apache Iceberg dataengineeringweekly.com…
👍︎ 12
💬︎
👤︎ u/vananth22
📅︎ Jul 26 2021
🚨︎ report
What is the difference between Apache, Apache Spark, Apache Hadoop, Databricks, Palantir Foundry?

Ok, so I have done a Udemy course on Spark(pyspark) and Databricks, I am reading books on Spark, and have done countless google searches. But I am still confused about what and how these different components work and cannot find a plain answer online.

I really hope someone will be able to help me with even the simplest of answers you could provide on these questions I have. They're really starting to confuse me and they feel like such basic things to know.

What I have read is Apache, is a web server that communicates between browsers and servers, as a middle man.

  • What does that mean for Apache Spark , Apache Hadoop and all the other Apache things out there? Does the use of Apache just mean they use the Apache web server?

Then I didn't understand, why does Spark use Hadoops File System if they they considered competitors who do the same thing? Do they both do a similar thing, and have the same goal?

Do you have to use Databricks to use Spark? Does spark always utilise databricks clusters, or can spark be used in a different way such as when using pyspark maybe with another provider through AWS? What else can you use Spark with instead of databricks if its possible?

Is Palantir Foundry the same type of product as Databricks, do they both have the same purpose?

Ok these were my burning questions. Would be so grateful if someone could explain these for me, would be a big help. Thank you!

👍︎ 9
💬︎
📅︎ Apr 18 2021
🚨︎ report
We're releasing our optimized Docker images for Apache Spark.. they contain Spark, Scala, Java, Hadoop, Python, and connectors to common data sources. We hope they'll save you from dependency hell :)! Read more at https://www.datamechanics.co/blog-post/optimized-spark-docker-images-now-available
👍︎ 5
💬︎
📅︎ Apr 19 2021
🚨︎ report
Should I implement a org.apache.hadoop.fs.AbstractFileSystem or a extended by org.apache.hadoop.fs.FileSystem?

[Copy of SO question, seeking info from users]

We are implementing a Spark client for direct access to lakeFS. This is a Git-like (versioned) storage layer on top of some other object store. We would like our file system to supply Spark (and other Hadoop-based tools) the ability to handle URLs such as `lakefs://repo/branch/path/to/object`.

Package org.apache.hadoop.fs supplies both AbstractFileSystem and an older FileSystem (links intentionally point at docs for an older version of Hadoop, to include all of our users). Any recommendations for which to implement?

  1. Is there any reasonable expectation by Spark users for current code to rely on being able to generate specifically a FileSystem (the older type)? Specifically, do you have (or know of any) code that relies on FileSystem?
  2. Is implementation of either of these options significantly more robust (or simple)?
  3. Can the same code support both types (presumably with different configuration options)?

Thanks for any pointers!

👍︎ 3
💬︎
👤︎ u/ariels-r
📅︎ Mar 25 2021
🚨︎ report
Docker Cluster of Apache Spark 3.0.0 and Hadoop 3.2 github.com/kadnan/docker-…
👍︎ 33
💬︎
👤︎ u/pknerd
📅︎ Jul 25 2020
🚨︎ report
An Easy Introduction to Apache Hadoop for Data Storage | iunera iunera.com/kraken/fabric/…
👍︎ 2
💬︎
📅︎ Feb 02 2021
🚨︎ report
Found my old laptop Asus XA452L ( i3 4030u, 10GB ram, 750GB HDD). Turned it to my dev server. I'm developing some apps which require Apache Spark, Hadoop as back end.
👍︎ 118
💬︎
📅︎ Mar 01 2020
🚨︎ report
An Easy Introduction to Apache Hadoop for Data Storage | iunera iunera.com/kraken/fabric/…
👍︎ 4
💬︎
👤︎ u/Timbo2020
📅︎ Feb 03 2021
🚨︎ report
Does migrating from on-prem Apache Hadoop to Amazon EMR make sense in terms of cost/utilization?

Hey folks,

I'm currently looking for/researching ways of making on-prem Apache Hadoop/Spark clusters more cost- and resource-efficient. A total noob here, but my findings now go like this:

- you should migrate to the cloud, be it as-is or with re-architecture

- you better migrate to Amazon EMR 'cause it offers low cost, flexibility, scalability, etc.

What are your thoughts on this? Any suggestions?

Also, I'd really appreciate some business (not technical) input on whitepapers, guides, etc. I could read to research the topic, to prove that my findings are legit. So far, I found a few webinars (like this one - https://provectus.com/hadoop-migration-webinar/ ) and some random figures at the Amazon EMR page ( https://aws.amazon.com/emr/ ), but I fear these are not enough.

Anyway, I'd appreciate your thoughts and ideas. Thanks!

👍︎ 7
💬︎
📅︎ May 20 2020
🚨︎ report
Big Data & Hadoop big data hadoop course,hadoop training,big data training online,hadoop course online,hadoop course,Apache Spark apache spark training, apache spark courses,spark training online,spark online course, best big data hadoop spark training.
👍︎ 2
💬︎
📅︎ Jan 27 2021
🚨︎ report
Apache Spark vs. Kubernetes vs. Hadoop/Yarn

Noob question. What is the difference between:

  • Apache Spark
  • Kubernetes
  • Hadoop or Hadoop/Yarn
👍︎ 3
💬︎
👤︎ u/leockl
📅︎ Aug 19 2020
🚨︎ report

Please note that this site uses cookies to personalise content and adverts, to provide social media features, and to analyse web traffic. Click here for more information.