• 1 Post
  • 27 Comments
Joined 2 years ago
cake
Cake day: June 12th, 2023

help-circle

  • well. indeed the devil’s in the detail.

    But going with your story. Yes, you are right in general. But the human input is already there.

    But you have to have human-made material to train the classifier, and if the classifier doesn’t improve, then the generator never does either.

    AI can already understand what stripes are, and can draw the connection that a zebra is a horse without stripes. Therefore the human input is already given. Brute force learning will do the rest. Simply because time is irrelevant and computations occur at a much faster rate.

    Therefore in the future I believe that AI will enhance itself. Because of the input it already got, which is sufficient to hone its skills.

    While I know for now we are just talking about LLMs as blackboxes which are repetitive in generating output (no creativity). But the 2nd grader also has many skills which are sufficient to enlarge its knowledge. Not requiring everything taught by a human. in this sense.

    I simply doubt this:

    LLMs will get progressively less useful

    Where will it get data about new programming languages or solutions to problems in new software?

    On the other hand you are right. AI will not understand abstractions of something beyond its realm. But this does not mean it wont expedite in stuff that it can draw conclusions from.

    And even in the case of new programming languages, I think a trained model will pick up the logic of the code - basically making use of its already learned pattern recognition skills. And probably at a faster pace than a human can understand a new programming language.







  • I could not comprehend what you were up to telling us.

    But the summary is:

    The key essence of this post is a deeply disillusioned and angry critique of modern American society, government, and technology. The author expresses a sense of frustration with the perceived emptiness, manipulation, and decay of U.S. institutions—seeing democracy as a facade, tech innovation as overhyped and hollow, and the government as ineffective. They convey a desire for systemic collapse or radical upheaval (accelerationism), suggesting that elites will soon resort to authoritarianism to maintain control. There’s also an undercurrent of socio-political pessimism, nihilism, and rejection of both corporate and state power—coupled with a belief that the current system is unsustainable and nearing a breaking point.




  • BTW, your hard disks are going to be your bottleneck unless you’re reaching out over the internet, so your best bet is to move that data onto an NVMe SSD. That’ll blow any other suggestion I have out of the water.

    Yes, we are currently in the process of migrating to PostgreSQL and to a new hardware. Nonetheless the approach we are using is a disaster. So we will refactor our approach as well. Appreciate your input.

    I don’t know what language you’re working in.

    All processing and SQL related transactions are executed via python. But should not have any influence since the SQL server is the bottleneck.

    WITH (NOLOCK)

    Yes I have considered this already for the next update. Since our setup can accept dirty reads - but I have not tested/quantified any benefits yet.

    Don’t do a write and a read at the same time since you’re on HDDs.

    While I understand the underlying issue here, I do not know yet how to control this. Since we have multiple microservices set up which are connected to the DB and either fetch (read), write or delete from different tables. But to my understanding since I am currently not using NOLOCK such occurrences should be handled by SQL no? What I mean is that during a process the object is locked - so no other process can interfere on the SQL object?

    Thanks for putting this together I will review it tomorrow again (Y).





  • broadly you want to locate individual records as quickly as possible by using the most selective criteria

    What can be more selective than "if ID = “XXX”? Yet the whole table still has to be reviewed until XXX is found?

    … and to familiarize yourself with normalization.

    based on a quick review of normalization, I doubt that this helps me - as we are not experiencing such links in the data. For us we “simply” have many products with certain parameters (title, description, etc.) and based on those we process the product and store the product with additional output in a table. However to not process products which were already processed, we want to dismiss any product which is in the processing pipeline which is already stored in the “final” table.

    It isn’t just a big bucket to throw data into to retrieve later.

    thats probably the biggest enlightment I have got since we started working with a database.

    Anyway I appreciate your input. so thank you for this.


  • A hot DB should not run on HDDs. Slap some nvme storage into that server if you can. If you can’t, consider getting a new server and migrating to it.

    Did this because of the convincing replies in this thread. Migrating to modern hardware and switch SQL server with PostgreSQL (because its used by the other system we work with already, and there is know-how available in this domain).

    You should avoid scanning an entire table with a huge number of rows when possible, at least during requests.

    But how can we then ensure that I am not adding/processing products which are already in the “final” table, when I have no knowledge about ALL the products which are in this final table?

    Create an index and a table constraint on the relevant columns. … just so that the DB can do the work for you. The DB is better at enforcing constraints than you are (when it can do so).

    This is helpful and also what I experienced. In the peak of the period where the server was overloaded the CPU load was pretty much zero - all processing happened related to disk read/write. Which was because we implemented poor query design/architecture.

    For read-heavy workflows, consider whether caches or read replicas will benefit you.

    May you elaborate what you mean with read replicas? Storage in memory?

    And finally back to my first point: read. Learn. There are no shortcuts. You cannot get better at something if you don’t take the time to educate yourself on it.

    Yes, I will swallow the pill. but thanks to the replies here I have many starting points on where to start.

    RTFM is nice - but starting with page 0 is overwhelming.


  • To paraquote H. L. Mencken: For every problem, there is a solution that’s cheap, fast, easy to implement – and wrong.

    This can be the new slogan of our development. :')

    I have convinced management to switch to a modern server. In addition we hope refactoring our approach (no random reads, no dedupe processes for a whole table, etc.) will lead us somewhere.

    As for how to viably decrease the amount of data in your active set, well, that’s hard to say without knowledge of the data and what you want to do with it. Is it a historical dataset or time series?

    Actually now. We are adding a layer of processing products to an already in-production system which handles already multiple millions of products on a daily basis. Since we not only have to process the new/updated products but have to catch up with processing the historical (older) products as well its a massive amount of products. We thought since the order is not important to use a random approach to catch up. But I see now that this is a major bottleneck in our design.

    If so, do you need to integrate the entire series back until the dawn of time, or can you narrow the focus to a recent time window and shunt old data off to cold storage?

    so no. No narrowing.

    Is all the data per sample required at all times, or can details that are only seldom needed be split off into separate detail tables that can be stored on separate physical drives at least?

    Also no IMO. since we dont want a product to be processed twice, we want to ensure deduplication - this requires knowledge of all already processed products. Therefore comparing with the whole table everytime.


  • First question: how many separate tables does your DB have? If less than say 20, you are probably in simple territory.

    Currently about ~50. But like 30 of them are the result of splitting them into a common column like “country”. In the beginning I assumed this lead to the same as partitioning one large table?

    Also, look at your slowest queries

    The different queries itself take not long because of the query per se. but due to the limitation of the HDD, SQL reads as much as possible from the disk to go through a table, given that there are now multiple connections all querying multiple tables this leads to a server overload. While I see now the issue with our approach, I hope that migrating the server from SQL server to postgreSQL and to modern hardware + refactoring our approach in general will give us a boost.

    They likely say SELECT something FROM this JOIN that JOIN otherthing bla bla bla. How many different JOINs are in that query?

    Actually no JOIN. Most “complex” query is INSERT INTO with a WHEN NOT EXIST constraint.

    But thank you for your advice. I will incorporate the tips in our new design approach.



  • first of all many thanks for the bullets. Good to have some guidance on where to start.

    2nd level cache shared between services

    I have read about this related to how FB does it. In general this means that fetching from the DB and keep it in memory to work with right? So we assume that the cached data is outdated to some extend?

    faster storage/cpu/ram faster storage/cpu/ram faster storage/cpu/ram

    I was able to convince management to put money into a new server (SSD thank god). So thank you for your emphasizes. We are also migrating to PostgreSQL from SQL server, and refactor the whole approach and design in general.

    generate indexes

    How would indeces help me when I want to ensure that no duplicate row is added? Is this some sort of internal SQL constraint or what is the difference to compare a certain list of rows with an existing table (lets say column id)?