Merge pull request #1 from scronge/add-spark-learnxinym-guide-scronge

[spark/en] Add Learn Spark in Y Minutes guide
2024-11-26 09:21:00 +03:00 · 2024-11-09 13:46:19 -06:00 · 2024-11-09 13:46:19 -06:00 · 40ca2fa049
commit 40ca2fa049
parent 77e2233bf3 045e9d22aa
1 changed files with 63 additions and 0 deletions
--- a/spark.html.markdown
+++ b/spark.html.markdown
@ -0,0 +1,63 @@
+---
+language: Spark
+category: tool
+tool: Spark
+filename: learnspark.spark
+contributors:
+    - ["YourName", "https://github.com/Scronge"]
+---
+
+[Spark](https://spark.apache.org/) is an open-source distributed data processing framework that enables large-scale data processing across clusters. This guide covers the basics of **Apache Spark** using PySpark, the Python API.
+
+```python
+# Setting Up Spark
+from pyspark.sql import SparkSession
+
+spark = SparkSession.builder \
+    .appName("ExampleApp") \
+    .getOrCreate()
+
+# Working with DataFrames
+data = [("Alice", 30), ("Bob", 40)]
+columns = ["Name", "Age"]
+
+df = spark.createDataFrame(data, columns)
+df.show()
+# +-----+---+
+# | Name|Age|
+# +-----+---+
+# |Alice| 30|
+# |  Bob| 40|
+# +-----+---+
+
+# Transformations and Actions
+
+df_filtered = df.filter(df.Age > 35)
+df_filtered.show()
+# +----+---+
+# |Name|Age|
+# +----+---+
+# | Bob| 40|
+# +----+---+
+
+# SQL Queries
+
+df.createOrReplaceTempView("people")
+spark.sql("SELECT * FROM people WHERE Age > 30").show()
+
+# Reading and Writing Files
+
+csv_df = spark.read.csv("path/to/file.csv", header=True, inferSchema=True)
+df.write.parquet("output_path")
+
+# RDD Basics
+
+rdd = spark.sparkContext.parallelize([1, 2, 3, 4])
+
+squared_rdd = rdd.map(lambda x: x ** 2)
+print(squared_rdd.collect())
+# Output: [1, 4, 9, 16]
+
+# Ending the Spark Session
+
+spark.stop()