Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools includingSpark SQLfor SQL and DataFrames,pandas API on Sparkfor pandas workloads,MLlibfor machine learning,GraphXfor graph processing, andStructured Streamingfor stream processing.
Document loaders
PySpark
It loads data from aPySpark DataFrame.
See a usage example.
Tools/Toolkits
Spark SQL toolkit
Toolkit for interacting withSpark SQL.
See a usage example.
Spark SQL individual tools
You can use individual tools from the Spark SQL Toolkit:InfoSparkSQLTool: tool for getting metadata about a Spark SQLListSparkSQLTool: tool for getting tables namesQueryCheckerTool: tool uses an LLM to check if a query is correctQuerySparkSQLTool: tool for querying a Spark SQL