big_data – Page 2 – Lester Martin (l11n)

develop, deploy, execute & monitor in one tool (welcome to apache nifi)

for those not familiar with apache nifi, come on a short overview of how this framework rather uniquely spans so many of the phases of the typical software development lifecycle

exploring ai data pipelines (hands-on with datavolo)

after explaining what rag ai apps are all about & showing what a typical ai data engineering pipeline looks like, i wanted to offer a hands-on lab exercise actually building a simple pipeline use datavolo cloud

understanding rag ai apps (and the pipelines that feed them)

i’m learning all about rag ai apps and wanted to try to explain, at a high-level, what these are all about plus do the same for the etl pipelines that are key to their success

iceberg snapshots affect storage footprint (not performance)

it is easy to understand why most folks initially imagine that iceberg’s ability to maintain a long history of snapshots will cause performance problems, but that is not the case — the real gotcha is that keeping many versions can quickly consume 2-10+ times the amount of data lake storage space

well designed partitions aid iceberg compaction (call them ice cubes)

despite what you may have heard, partitions are not dead (yes, there are multiple tools in the shed) and using a well-defined partitioning strategy with apache iceberg can help prevent concurrency issues when compacting files

reasons to avoid apache iceberg (clickbait)

wrapper post for two starburst deliverables (webinar & blog post) discussing why you should, or maybe shouldn’t, move from apache hive to apache iceberg for your data lake table format

iceberg materialized views in galaxy (no más storage_schema)

starburst galaxy, as a saas offering, just keeps slipping in nice bits of features & functionality — this one tackles hiding the underlying storage table of an iceberg materialized view

recap of the inaugural iceberg summit (my top 5 observations)

tl;dr – iceberg is pervasive, the real fight is for the catalog, concurrent transactional writes are a bitch, append-only tables still rule, and trino is widely adopted

joining spark dataframes with identical column names (an easier way)

presenting an easier solution to the problem of colliding column names when joining spark dataframes than i previously offered in my most popular post that just happens to be four years old — some things do age well

pystarburst in 90 seconds (try it)

still thinking about trying to get a pystarburst code stub up/n/running? starburst galaxy makes it pain free and you can even get your first dataframe created via python in under 90 seconds — why not give it a try?